📜 ⬆️ ⬇️

Choosing an html-parser for Apache.JMeter

The average quality of completeness of extracting links to embedded resources of Apache.JMeter html parsers
The average quality of the work of parsers (for seven sites)

I suggest:


As popular wisdom says: Believe believe, but ...

')

Project Description


When Apache.JMeter 3.0 came out, I was still in development, my colleagues and I tested it and started using it in my work. The first in combat testing was a new version of Artem Zetsigemon for one of the new services, pgu.mos.ru , where using Apache.JMeter 3.0, the quality of parsing of embedded resources was significantly improved compared to the previous version.
And then the question arose, what is the quality, how to measure it for different sites, what it was and how it became?

Materials and research results are reflected in the current article.

Test object


HtmlParser-s for Apache.JMeter 2.13 and Apache.JMeter 3.0 are tested .

Apache.JMeter 2.13 parsers:


Parses Apache.JMeter 3.0 :


Parsers parse the start pages of various websites:


The basis of testing


The basis was the changes in Apache.JMeter 3.0 , see http://jmeter.apache.org/changes.html .

Excerpts from the list of changes:

Core improvements


Dependencies refresh

Deprecated Libraries dropped to date:
  • htmllexer, htmlparser removed
  • jdom removed

Removed parser htmlparser and more unused library jdom .

Protocols and Load Testing improvements


Parallel downloads are much better:

  • Parsing of CSS imported files (through import ) or embedded resources (background, images, ...)

A new parser for CSS files has been added, links to other CSS files (via import ) and links to resources specified in CSS files will be extracted: background images, images, ...

Incompatible changes


  • Since it was a version of the parser for the embedded resources (replaced since 2.10 by the Lagarto based implementation), it has been updated with the htmlparser library (HtmlParserHTMLParser).
  • The following jars have been removed:

Removed parser htmlparser and more unused libraries htmllexer and jdom .

Improvements


HTTP Samplers and Test Script Recorder

  • Bug 59036 - FormCharSetFinder: Use JSoup instead of deprecated HTMLParser
  • Bug 59033 - Parallel Download plug-in parser for each mime types
  • Bug 59140 - Parallel Download: Add CSS Parsing to extract links from CSS files

To search for the accept-charset attribute in form tags, JSoup is now used instead of remote HTMLParser [Bug 59036]. Implemented CSS parser [Bug 59140] and this parser is used by default [Bug 59033].

Test objectives


Compare the work of all available parsers. In particular, compare parsers versions 2.13 and 3.0 to each other, make sure that the load of embedded resources has become more realistic and better.

Strategy


Stage 1:

  1. Download the starting pages of the list of sites using all 5 Apache.JMeter 2.13 parsers and write logs.
  2. Download the starting pages of the list of sites using all 4 Apache.JMeter 3.0 parsers and write logs.
  3. Analyze the Apache.JMeter logs and compare them with each other. Evaluate whether the load on embedded resources has become better, whether the list of downloaded embedded resources has expanded.

Stage 2:

  1. Download the starting pages of the list of popular sites using Google Chrome and webpagetest.org service.
  2. Analyze the reports from webpagetest.org and compare them with the results of the analysis of Apache.JMeter logs. Rate, realistic loading of embedded resources.

Testing approach


To accurately determine how many requests are sent during the opening of a page from Apache.JMeter, all requests are logged:


In order to be able to group requests by Apache.JMeter versions, parsers and sites, additional variables for each request will be logged:


results


Evaluation of the improvement of parsers for version 3.0 compared to version 2.13


There are no dramatic improvements in the completeness of parsing html-pages, there are worsening.

The essential difference is that in the Apache.JMeter 3.0 parsers there is a recursive loading of the Yandex Browser browser promotional materials page. This is manifested when downloading https://yandex.ru/ .

Sites with a small amount of content - a good result


On simple sites such as jmeter.apache.org, all parsers work the same way. By creating the same number of subqueries that the browser creates. The quality of the parsers for jmeter.apache.org is perfect, 100%.

Sites with a lot of content - bad result


But on a site like mos.ru, the parsers will find an average of 22 links to embedded resources, while a full page load with the browser loading all embedded resources - 144 requests. The quality is low.

Similarly, on the site habrahabr.ru , the Lagardo parser from Apache.JMeter 3.0 will find 55 links, while the browser will make 117 subqueries. The quality is 47.01%. Satisfactory quality of completeness of extracting links to embedded resources.

Number of subqueries using different parsers


Table on Google Docs: JMeter.HtmlParser.Compare (top table) .

Summary table comparing the work of various Apache.JMeter html parsers with the results of Google Chrome
Apache.JMeter statistics on versions and html parsers and its comparison with Google Chrome

Column description:


The good result of the parsers will be if the subqueries are as many as the Google Chrome browser does at the time of Document Complete , and excellent at the time of Fully Loaded . The measure of the realism of Apache.JMeter when using a particular parser is the proximity of the number of subqueries to the number of subqueries executed by the browser at the time of Fully Loaded .

If we exclude the results of testing site yandex.ru, where:


And as a measure of the quality of the work of parsers to accept the number of subqueries at the time of Fully Loaded , we get such a table of the average quality of work of the parsers.

The average quality of the parsers


Table on Google Docs: JMeter.HtmlParser.Compare (lower table) .

The average quality of the parsers
The average quality of the work of parsers (for seven sites, without yandex.ru)

The most accurate HTML Parser parser in Apache.JMeter 2.13. In Apache.JMeter 3.0, the Jsoup and JTidy parsers showed the same quality. Parser Lagarto behind the leaders. The parsing completeness for the Lagarto parser in Apache.JMeter 3.0 has decreased in comparison with Apache.JMeter 2.13.

The quality of the Lagarto parser on the current version of Apache.JMeter 3.0 was 32.73%, only a third of all the subqueries were sent, two thirds of the load on the static was not submitted.

Detailed analysis of missing links when working Apache.JMeter 3.0 site habrahabr.ru


Spreadsheet on Google Docs: Missed default Apache.JMeter 3.0 parser links to habrahabr.ru
Missing Link Summary
Missed Apache.JMeter 3.0 links when parsing habrahabr.ru

Missing a lot of promotional materials and statistics. And this is good.
Above I have already made a reservation that I will evaluate only quantitative characteristics. By numbers, it turns out that the quality of extracting links to embedded resources is low. Detailed analysis can show exactly what was missing. It may turn out that this is advertising and content from third-party resources, which need not be loaded as part of the load testing of a specific site.

Logs and their processing


Initial data


All logs are available at the link: https://drive.google.com/drive/folders/0B5nKzHDZ1RIiVkN4dDlFWDR1ZGM .

WebPageTest.org reports


sytekeywebpagetest.orgRaw page data (.csv)Raw object data (.csv)HTTP Archive (.har)
github.com160819_VF_FM8github.com.summary.csvgithub.com.details.csvgithub.com.har
google.ru160819_C9_FQDgoogle.com.summary.csvgoogle.com.details.csvgoogle.ru.har
habrahabr.ru160819_8N_FRBhabrahabr.ru.summary.csvhabrahabr.ru.details.csvhabrahabr.ru.har
jmeter.apache.org160819_CG_FSMjmeter.apache.org.summary.csvjmeter.apache.org.details.csvjmeter.apache.org.har
linkedin.com160819_K2_FY1linkedin.com.summary.csvlinkedin.com.details.csvlinkedin.com.har
mos.ru160819_91_G0Fmos.ru.summary.csvmos.ru.details.csvmos.ru.har
stackoverflow.com160819_S0_G18stackoverflow.com.summary.csvstackoverflow.com.details.csvstackoverflow.com.har
yandex.ru160819_MR_G1Ryandex.ru.summary.csvyandex.ru.details.csvyandex.ru.har

Report Images








The caps of the webpagetest.org reports from which Document Complete and Fully Loaded data were extracted

From the values ​​of the columns Document Complete and Fully Loaded, you need to exclude one query (root) to get the number of subqueries.

Logs Apache.JMeter


For processing, csv-logs are used, formed by the CsvLogWriter plugin:


The third-party plugin is used to request requests for embedded resources in the csv-log.

As a result of CsvLogWriter operation, a log is generated, the list of columns of which includes:



Log Processing Automation


Aggregation of Aps.JMeter csv logs is done using pandas with this python code:

 import pandas as pd import codecs from os import listdir import numpy as np #  -       . dirPath = "D:/project/jmeter.htmlParser.3.0.vs.2.13/logs" read_csv_param = dict( index_col=['timeStamp'], low_memory=False, sep = ";", na_values=[' ','','null']) #   csv-    . files = filter(lambda a: '.csv' in a, listdir(dirPath)) #    csv-  DataFrame dfs. csvfile = dirPath + "/" + files[0] print(files[0]) dfs = pd.read_csv(csvfile,**read_csv_param) for csvfile in files[1:]: print(csvfile) tempDfs = pd.read_csv(dirPath + "/" + csvfile, **read_csv_param) dfs = dfs.append(tempDfs) #dfs.to_excel(dirPath + "/total.xlsx") #     JSR223,      ,   HTTP Request Sampler. #  JSR223 URL ,  HTTP- URL . dfs = dfs[(pd.isnull(dfs.URL) == False)] #     ,   report.subrequests.html -   . #      ,    . #    -   ,   . pd.pivot_table(dfs, index=['siteKey', "jmeterVersion", "htmlParser"], values="URL", columns=["i"], aggfunc=lambda url: url.count()-1).to_html(dirPath + "/report.subrequest.count.html") 

Recursive download on yandex.ru


Apache.JMeter goes into recursion
Recursive loading of embedded resources for the current version of Apache.JMeter 3.0 with default settings (html-parser Lagarto) on the website yandex.ru

As seen:

  1. Apache.JMeter finds and follows the link https://yandex.ru/clck/redir/dtype=stred....7004fcb3793e79bb1ac9e&keyno=12
  2. Then it finds a new unique link https://yandex.ru/clck/redir/dtype=stred....cd1c46cad58fbfe2f61&keyno=12
  3. And so on, goes into recursion.

In this case, this is the picture inside the link to download Yandex Browser:

Markup fragment when working with which the recursion step occurs in Apache.JMeter
Fragment of the html-code of the site yandex.ru processing of which adds a new recursion step, links and a picture for downloading Yandex Browser

The parser finds the picture. JMeter tries to download it, gets a html page in response, there is again a link to the picture and other links. And everything repeats. Apache.JMeter behavior is correct.

And in Apache.JMeter 2.13, the recursion occurs only on the HtmlParser parser, why does not guess why it happens to others:



I will not guess. Seems a hopeless situation. But such situations do not happen. There is always a solution.

For example, you can try to specify Yandex Browser as User-Agent . Then the server probably will not show a picture for downloading the browser, or the picture will respond to the picture request, and there will be no recursion. This is a guess, did not check it.

Now the User-Agent for Google Chrome was specified in the script for synchronization with the work of webpagetest.org, and seeing the server not seeing your browser, apparently, suggests a link to your own.

The composition of the project



The test is easy to change for yourself, specify your sites and the desired number of iterations. All settings are specified in the test.bat file.

 CALL jmeter.2.13.bat http://stackoverflow.com/ 5 stackoverflow.com CALL jmeter.2.13.bat https://habrahabr.ru/ 5 habrahabr.ru CALL jmeter.2.13.bat https://yandex.ru/ 5 yandex.ru CALL jmeter.2.13.bat https://www.mos.ru/ 5 mos.ru CALL jmeter.2.13.bat http://jmeter.apache.org/ 5 jmeter.apache.org CALL jmeter.2.13.bat https://www.google.ru/ 5 google.ru CALL jmeter.2.13.bat https://www.linkedin.com/ 5 linkedin.com CALL jmeter.2.13.bat https://github.com/ 5 github.com CALL jmeter.3.0.bat http://stackoverflow.com/ 5 stackoverflow.com CALL jmeter.3.0.bat https://habrahabr.ru/ 5 habrahabr.ru CALL jmeter.3.0.bat https://yandex.ru/ 5 yandex.ru CALL jmeter.3.0.bat https://www.mos.ru/ 5 mos.ru CALL jmeter.3.0.bat http://jmeter.apache.org/ 5 jmeter.apache.org CALL jmeter.3.0.bat https://www.google.ru/ 5 google.ru CALL jmeter.3.0.bat https://www.linkedin.com/ 5 linkedin.com CALL jmeter.3.0.bat https://github.com/ 5 github.com 

You can then insert the results from an Excel file with customized formulas and get a descriptive table of results.
You can try to refine the parsers, and follow a similar method to track the improvement in the quality of parsing embedded resources.

findings


There is no special practical value in the article. But some useful conclusions can be made:

Source: https://habr.com/ru/post/308254/


All Articles