Dmitri Tikhanski is a Contributing Writer to the BlazeMeter blog.

Learn JMeter in 5 Hours

Start Learning
Slack

Run massively scalable performance tests on web, mobile, and APIs

Nov 16 2016

How to Spider a Site with JMeter - A Tutorial

When it comes to building a web application load test, you might want to simulate a group of users “crawling” the website and randomly clicking the links. Personally, I don’t like this form of testing as I believe that the load test needs to be repeatable, so each consecutive test hits the same endpoints and assumes the same throughput as the previous one. If the entry criteria is different, it’s harder to analyze the load test results.

 

But sometimes this rule isn’t applicable, especially for dynamic websites like blogs, news portals, social networks, etc., where new content is being added very often or even in real time. This form of testing ensures the user will get a smooth browsing experience and also checks for broken links or any unexpected errors.

 

This article covers the 3 most common approaches of simulating website “crawling”: clicking all links found in the web page, using HTML Link Parser and the advanced spidering test plan.

 

1. Clicking All Links Found in the Web Page

 

The process of getting the links using Regular Expression Extractor is described in the Using Regular Expressions in JMeter article. The algorithm is as simple as this:

 

1. Extracting all the links from the response with the Regular Expression Extractor and store them into JMeter Variables. The relevant regular expression would be something like:

 

<a[^>]* href="([^"]*)"

 

Don’t forget to set “Match No.” to -1 to extract all links. If you leave it blank, only the first match will be returned.

 

regular expression extractor jmeter

 

2. Use ForEach Controller to iterate the extracted links

 

3. Use HTTP Request sampler to hit the URL living in the “Output Variable Name”

 

Demo

 

regular expression extractor jmeter demo

 

Pros:

● Simplicity of configuration

● Stability

● Failover and resilience

 

Cons:

● Regular Expressions are hard to develop and sensitive to markup change hence fragile

● Not actually a “crawler” or “spider”, just consequential requests to links

 

2. Using HTML Link Parser

 

JMeter provides a special Test Element - HTML Link Parser. This element is designed for extracting HTML links and forms, and substitute matching HTTP Request Sampler relevant fields with the extracted values. Therefore HTML Link Parser can be used to simulate crawling the website with minimal configuration. Here’s how:

 

1. Put the HTML Link Parser under a Logic Controller (i.e. Simple Controller if you just need a “container” or While Controller to set a custom condition like maximum number of hits)

 

2. Put the HTTP Request Sampler under the Logic Controller and configure Server Name or IP and Path fields to limit the extracted values to an “interesting” scope. You probably want to focus on domain(s) belonging to the application under test and don’t want it to crawl over the Internet, because if your application has any link that leads to an external resource - JMeter will go outside. Perl5-style regular expressions can be used to set the extracted links scope.

 

html link parser jmeter

 

Demo:

 

html link parser demo jmeter

 

Pros:

● Easy to configure and use

● Acts like a “spider”

 

Cons:

● Zero error tolerance, any failure to extract links from the response will cause cascade failures of subsequent requests

 

3. Advanced “Spidering” Test Plan

 

Assuming the limitations of the above approaches, you might want to come up with a solution that won’t break on errors and will be crawling the whole application under test. Below you can find a reference Test Plan outline which can be used as a skeleton for your “spider”:

● Open the main page

○ Extract all the links from it

○ Click on a random link

○ If the returned value has “good” MIME type (if link results in an image or PDF document or whatever link extraction will be skipped) - extract all links from the response

 

Advanced “Spidering” Test Plan jmeter

 

Explanation of the used elements:

While Controller: to set the maximum amount of requests so the test won’t last forever. If you limit via scheduling you can skip it

Once Only Controller: to execute a call to the main page only once

XPath Extractor: - to filter out urls that doesn’t belong to the application under test and other links that are not interesting like “mailto”, “callto”, etc. An Example XPath query will look like:

 

//a[starts-with(@href,'/') or starts-with(@href,'.') or contains(@href,'${SITE}') and not(contains(@href,'mailto'))]/@href

 

Using XPath is not a must and in some cases it may be very memory intensive, so you may need to consider other ways of fetching the links from the response. It is used for demonstration purposes as normally XPath queries are more human-readable than CSS/JQuery and especially Regular Expressions

 

__javaScript() function: actually makes 3 things:

a. Chooses random link out of the extracted by the XPath Extractor

b. Removes “../” bit from the beginning of the URLs

c. Sets HTTP Request title to the current random URL

Regular Expression Extractor: to extract Content-Type header from the response

If Controller: the next round of extracting links from the response will start only if the response has matching content type

 

Demo:

 

Advanced “Spidering” Test Plan jmeter demo

 

Pros:

● Extreme flexibility and configurability

 

Cons:

● Complexity

 

If you want to try the above scripts yourself - they are available at jmeter-spider repository. As usual, feel free to use the below discussion box to express any form of feedback.

Interested in writing for our Blog?Send us a pitch!

Your email is required to complete the test. If you proceed, your test will be aborted.