📜 ⬆️ ⬇️

What does UserAgent eat


msdn_ua

To begin with, of course, it was worth telling why they are eating this very “user agent”. Well, or, in general, to start with the fact that this is an agent. (By the way, no one knows any Slavic-Orthodox translation of this term?) But hoping that the habr user either already knows and uses the useragent or does not need it, I would not want to dwell on the prefaces . And so, my advice - use useragent with regular expressions!


')
Of course, you tend to regularly use expressions,% username%, but they are different for the soul, and I'm about regex. One of the main tasks in my work is to correctly identify the capabilities of the device and the end-user browser. Since we focus on mobile devices (cell phones), I will take them as an example. Unlike users of ordinary computers, users of mobile devices are severely and severely limited in screen resolution, browser capabilities, etc. We have a small database compiled and automatically updated with UAProf and Wurfl . But agent headers (useragent headers) are constantly changing and the number of differences is constantly increasing. About how to do a search for the next device by checking the agent one-to-one is out of the question, but how to look for the same. Therefore, we began to deal with the useragent device and what can be squeezed out of it.

Ingredients


Standards and format - as usual nobody observes them. The useragent format varies from manufacturer to manufacturer and from series to series. In addition, most mobile operators like to rewrite titles.
The main blocks should be such:
device / version browser / version (supported standards and technologies).
msdn_ua
The first example of the sonyericssonk530i / r6bc browser / netfront / 3.3 profile / midp-2.0 configuration / cldc-1.1 tells us that no brackets can be expected, and the second example mozilla / 5.0 (symbianos / 9.4; u; series60 / 5.0 nokia5800d- 1 / 21.0.025; profile / midp-2.1 configuration / cldc-1.1) applewebkit / 413 (khtml, like gecko) safari / 413 gently hints that no one will follow the order. But it is still important for me to know that different agents appear on the same device, for example nokia n95:


Recipe


However, as you can see, some kind of logic exists. After the slash (/) comes the version - the dynamic part, which does not play a special role. Be sure to have an indication of the browser. Separation of tokens with a space and / or a semicolon. Twisting the logs, we found a lot of garbage in the agent headers, so the standardization and allocation of segments was the first step. Such utility turned out here:
  1. Choose what is really useragent: ([[(]?[a-z0-9._+;]\s?[/\-;:\\,*\s]*[)\]]?\s?)*
  2. Define the browser token: ((iemobile|kbrowser)\s[0-9.]+)|((up(\.link)?|netfront|obigo|opera\s?(mini|mobile)?|deckit|safari|(apple)?webkit|mozilla|openwave)/[0-9\.az\-]+\+?)|(browser/[az\-0-9]+/?[0-9\.az\-]+)|([az\.-]+browser[az\.-]*(/[0-9\.az\-]+)?)
  3. Define the profile and configuration: (((profile|configuration|java(platform)?)/[az]+-?)|((cldc|midp|wap)[\s\-]?))[0-9\.-az]+
  4. Language: ((?<=[\s;\[\(])[az]{2}[\s-][az]{2}(?=[\s;\]\)]))|\[([az]{2,3}[\-_\s]?)+\]
  5. Version: [\s;/]+(v(er)?[\s.]*)?[0-9]+\.[0-9\.]+([az]{1,2}[0-9\.]*)?
  6. Sometimes indicate the screen size in pixels: [0-9]{3}x[0-9]{3}

Naturally, a hundred percent result did not work out, but a run through 30,000 useragents showed that the correct segments were highlighted at 97%. So the result is quite decent. But we did not have enough. Some things need to be checked on the database and there is all the same variation and diversity of models and agents. A simple and intuitive idea appeared - search by model. That is, despite the fact that there are more than a dozen different useragents for the same 95th Nokia, nokian95 is present in each version. The task would be trivial if it were necessary to identify / search only the same model (let’s say, find out the iPhone or not). But then if-else would be enough. Life is more complicated and there is simply no universal standard for defining a model.

Dessert


We went from the opposite - we will clean the useragent from those tokens that we have learned to define.
Using the same expressions (with easy changes) I erase the useragent blocks one by one (pseudo-code while useragent ismatch replace match with string.empty ). It turns out the rest of the unknown to me in advance pieces, some of which are rubbish, and some one - a model. The simplest solution was to split the balance into separate tokens - Split(' ', '/', ';') and search for a token with the manufacturer. We are looking for which part contains one of the following lines:

"nokia", "motorola", "mot-", "moto-", "motorazr", "sonyericsson", "samsung", "sec-", "sgh-", "lg-", "lge", "lg", "sie-", "siemens","ipod", "iphone" ,"ipaq", "spv", "i-mate", "mobilephone", "htc", "vodafone", "palm", "rover", "gigabyte", "asus", "alcatel", "mitsu", "verizon", "apple".


Now, from the above different long useragents of the n95, I only have nokian95 and nokian95_8gb, respectively. Here are some more examples of complete useragents and cleanup results:


On the road


In addition to the browser, you may be interested in the WAP token (WAP 1.0 for short = WML, WAP 2.0 = XHTML). The mmp (multimedia mobile processor) version should indicate support for audio / video codecs - 1.0 audio mp3 only, and 2.0 supports 3gp video. Most of the ip(hone|od).*?os\s*(v(er(sion)?)?)?[\s.]*([0-9._]+|[az]+) operating system and version - relevant for the iPhone: ip(hone|od).*?os\s*(v(er(sion)?)?)?[\s.]*([0-9._]+|[az]+)

Enjoy your meal


Check on the database and fit (finetunning) led to 99% of the result. This is certainly an obvious overfitting, but it was one of the goals (maximum accuracy in a certain audience and region). By the way, the above regexs are more abstract and should give greater error due to their universality.

Source: https://habr.com/ru/post/80038/


All Articles