HTML parsing in browsers to change block layout

Consider the task of HTML parsing on the client (Javascript) with the subsequent design of the data obtained by styles and layout and display them in the right places on the viewing page. The use of such a loader pages and blocks found in userscripts - when the developers of the loader has nothing to do with the developers of the site. But there is reason to use the approach for ordinary sites for the complete separation of the View from the Model.

The article turned out to be theoretical, because due to its volume I did not overload it with practical results. Yes, and it is difficult to imagine some steps by which everyone could pick up ideas and start building something like this. First, you need to sow ideas, but so far attempts to plant (here, on Habré) did not sprout, although I didn’t really try in that direction. The approach has evolved over the past six months and was even announced on Habré around April-May. The article tells “how to do it,” and lists the benefits of the approach. It requires deep and specific JS programming. According to the results of the work, it most likely makes sense to allocate a library for similar tasks.

Introduction

When parsing pages and building new data from the obtained pages in one HTML block, a number of algorithmic tasks are interwoven, from which a separate design pattern (pattern) emerges. Its goal can be described as the creation of a dynamic template, various or all parts of which may be missing at the construction stage, but in the process of adding data, a partially or fully assembled block appears, constructed in the correct order of elements and using the necessary styles.

Example: to the existing page with the title of the article a full article was loaded and the missing parts were added to the title. Such an example was constantly working in HabrAjax - it loaded full articles to the annotation tape and other tapes, and this was the main and most confusing part of the script (it worked on several types of pages). The second example: there is only the title and reference of the article; we load the whole article and lay it out side by side in a layout that is different from the original one. In the future, we can change the layouts and styles - set the block or the entire viewing page parameter to the desired presentation.
')
The second example at the beginning was the desire to add to a working script in which all the roles of the pattern were mixed. Something even started to turn out, but with a lot of flaws - the article had no title, tags, author and other attributes. And the old script began to fall apart - its part, working in questions and answers, ceased to understand the changes. After a day of exercise with spaghetti code, it became clear that it was better to rewrite everything, laying down a systematic approach. So the new code and pattern turned out.

Structure

Our handler of loadable articles has this structure.

* AJAX-loading article . The input is given the address, the output is the HTML code of the page with the article and a number of unnecessary data. These data may later also be useful - various informers and dynamically changing links. Therefore, in the future, the parser script can extract more useful information from loading pages.

* Parser page . In its current form, it retrieves meaningful data blocks — an article, an author, a title, a date, comments, an article’s score. It can either leave the classes of blocks or delete them in order to load the data into its own blocks, changing the presentation styles. For the distant future, attaching to existing page classes is just harmful - the page creators can change the class name at any time (date to published, published to current_date), and if they are present in the styles of their page, they will need to change styles. If you get rid of the classes of the original page, all changes will remain in the parser. Taking care of change tracking is also on the parser.

* Subloader styles . While the script for displaying pages has a single style, the loader is not needed. You can simply set the styles in CSS or load them with a script once. But if you want to change styles dynamically, it is easy to do with a loader. The styles will be in the DOM as rule lists. A specific list of rules needs to be deleted, and another one - to load (by the standard method).

* Iterator placement layout . Depending on the state of the browsing page, the iterator places the layout blocks with new data from the parser embedded in them in the right places on the page. A part of the data can be hidden to be shown by a user’s click — the display states are controlled by an iterator (because it is, by definition, an iterator).

With such a distribution of responsibilities, the loading system takes on a manageable look and becomes less confusing than if it were written as before, all in a row, without separating the roles of the subsystems. This is the essence of our newly invented block loading pattern .

If you look at the result code and the original prototype, then, indeed, it is clear that from the parts that are not responsible for the submissions, styles have been deleted and transferred to their parts. Each block of the system is separated from the others, the influence of blocks is limited by the data flows between them.

Work order, connection

In order for the entire wonderfully assembled system to start working, it needs to be launched with something. Launch points are links and active areas (buttons) on an existing page. It can be either a simple page converted for the launch of the loader, or its own page that was previously generated and has similar buttons and launch links.

Starting from a simple page consists of intercepting user clicks. Instead of, for example, loading a new page with an article, clicking on the link (title, cat, comments) activates the AJAX loader. Then everything goes along the chain of control transfer, and as a result, a block with a loaded article and comments is formed near the link or button.

Everything is good until the change in the layout of the original page by the creators of the site. If the layout, classes and other ways of binding our script have changed, it simply will not find its launch points, it will not place its handlers.

In order to better control the article display page, it also needs to be processed in advance by the same parser, avoiding displaying the layout and site styles completely - then the task of starting the parser completely goes into it - if the parser recognizes the start page (with the list of article headers), the user will already work with your page made of a template completely built on client styles. There can only be one alarm outside: the page is not recognized.

If the layout of the site has changed so much that the parser has not found its control points, the parser reports this to the handler, and for example, it is decided to show the unrecognized page in a new window. Or in the existing window - the main thing is that the data is displayed, although not under the control of the system. And to the developers of the parser, there is a signal about the need to finalize it under the changed conditions.

Practice shows that deep changes in the layout of the site does not happen for years, while style changes occur regularly, every 2 to a few months. If you allow the display of site styles that are mixed with your own styles (user styles), then you will have to correct the flaws of the show regularly, with each change in the site styles. If you show all your styles completely, passed through a handler, a buffer is created for a pure display of data, the display of which can be broken only in rare emergency cases of a full change of layout and design. And even these cases, the developed parser will be able to correctly process, showing the source pages or reporting minor data failures in the usual places.

Ideally, the parser should control all non-essential parts of the page in order to detect changes that are different from the usual ones - new blocks, informers, advertisements, and all innovations should be passed on to developers. In the case of loadable blocks (both advertisements and useful parts), the parser should work in a hidden image of the live source page. This mode is not always possible (a live page can escape from the parent window to the parent window) and is not always needed - it is needed to periodically monitor unusual innovations on the site and is mostly interesting for developers. Fortunately, flying banners (a typical example of loadable blocks) almost never carry meaningful information related to the pages of the site. Although you can imagine that in this way, the authors of the site can sometimes report on their own promotions.

Application area

In this form, the handler of pages and blocks is created to work in user scripts - when the developers of the handler are not connected in any way with the developers of the site. Therefore, they have to learn about changes in the frontend of the site on the fact of changes. Does it make sense to use the subloader not as a user script, but as a site script?

Rather, the question already looks rhetorical, with a single answer. There is at least one user-friendly system in the loader that will work with success and not in user scripts - this is the display of pages in several possible views. Users choose their viewing style. The original page styles can become extremely poor, because they are no longer required to display (except for search engines). All the complexity of the display goes to the loader, and to work with the site we (as the developers of the site) get the upload buffer. We can already give the data not in pages, but in blocks, as in a normal Ajax. Do not take care of the presentation if it was taken care of in the scope of the loader. And as a bonus, we have a frontend change control system on the part of the loader, which can verify the integrity of the data supply. If something is forgotten at the block level, a signal about the absence will come from the loader, in the process of testing the pages.

Finally, it is possible to arrange a demonstration of the test presentation of pages, when the subloader displays not only user data, but also all the service ones going to the frontend — hidden blocks, hidden parameters that will work once, but with ordinary frontend tests, they are easy to miss or about them to forget. However, Selenium is also able to control them, if the tests are written, the only difference is in where to write the tests - in the loader or in the test shell.

As a result, we see that the subloader can work as a site subsystem, simplifying the change of page views and better separating the presentation from the data than is done with traditional layout and styling. We can make the data layer much easier, without layout at all, if we work directly and only with the AJAX-loader. In this, yes, nothing new - all Aayax sites work this way. The special thing is that all the care of loading is collected in a single center, and the whole conclusion goes through it. Other data streams will be a violation and deviation from the pattern.

How, then, with MVC sync? They can live as usual, regardless of the loading pattern, being the very “violations”. It will look like this. The upload pattern created the presentation page. In it, a certain “backbone” (nominatively) created its synchronization unit, having established a connection with the backend. It turns out that in a particular view, one of several, a View add-in was created. If we change the view, you need to somehow replant this add-in to a new View or resume its work. In fact, this is a conflict of patterns that needs to be taken into account. But this is inevitable, because both patterns work with the idea, without knowing anything about each other.

Otherwise, for the ease of changing ideas and better separating the layout from the data, it is worth trying this pattern as a subsystem of the site.

How do layout designers?

Cases where the scripts actively invade the frontend are always accompanied by the complication of layout work. If you use JS-template engines, then some, smaller part of the developers proudly declares: “we do not need web designers”, the rest is silent, understanding that they have added more work to themselves.

To simplify the layout, the page loading pattern retained traditional HTML + CSS block template layouts. A coder can work with blocks and styles for them without touching the rest of the javascript script filling. If you need to make a new presentation of the pages (new layout, “pull” the new design), take all the blocks we have in the project, see which ones belong to this page. In turn, we change the patterns and styles of each block. All blocks are designed so that their layout does not depend (or almost does not depend) on the layout of external blocks, with the exception of controlling general CSS classes. Usually, it’s enough to not include other classes in the styles and describe the styles of your external HTML elements in each block. “The concept of independent units”, in terms of Yandex.

To observe the results you need to have a number of test pages. They are formed from data blocks similar to real ones and include various test cases of the show. In the case of userscripts, such data will be special test blocks on which the layout designer works out the presentation of the pages. It is convenient to watch such blocks on a test domain or through the substitution of a real domain with a local domain (the hosts file on Windows, / etc / hosts on Linux). These blocks are designed as simple static files.

The advantage of test pages for userscripts is that if the site is temporarily unavailable or has changed the design, work with test pages is not interrupted, and in the case of a change of design, examples of previous own designs and layout are available.

For the site script of the pattern-loader, test pages allow you to store a collection of designs also regardless of the state of the backend. In the backend, for this you need to provide (better by means of the frontend) switching to static test pages and their versions.

What if we need a JS template?

If the HTML template for the parser is not enough (for example, the template is needed in JSON, since it is modified by the settings), then the layout designer loses the possibility of layout in the familiar environment. Of course, you can teach him a little how to work with JSON instead of HTML, but it is quite possible that from each person this will increase the number of layout errors. The best option (not tested in this application) is probably to make a JSON-HTML template converter on JS in order to work with the same familiar environment.

How it works?

Let's try to describe the principles of the pattern, showing the main parts of the codes.

Let's build HTML templates for parsing and rewriting an article into a new layout - into an article's own layout, similar to the layout of the site, but not required to be exactly the same. It contains some control attributes, so it is not just a template for outputting a block (in this case, articles with comments), but has a number of user-defined functions.

Business will not be limited to one template, because we need to unlink our template from the page of the site. The template for parsing - repeats the elements of the site and serves to control and extract information, and the template for displaying - to create HTML-blocks of data presentation (in this case, articles with comments). But for simplicity, we will show an approximate form of the template, working for both. We can even make it one, but this is a violation of the approach, as it adds work in pursuit of changes to the site.

The display template, therefore, can sometimes require correction - eliminating unnecessary tags for this presentation. But you can make such blocks invisible using styles or hide data in tag attributes. For another version of the layout we can build a different template if necessary.

Show code (comments are optional, are given for clarification)

<div class="post"> <div class="published"> <!-- --> </div> <h1 class="title"> <!-- (    - , )--> </h1> <div class="hubs"> <!----> </div> <div class="content"> <!-- --> <div class="clear"></div> </div> <div class="btnBack" style="display: block;"> <i>← </i> <div class="percent"> <div class="gPercent"><div style="width:<!--  -->px"></div></div>  <!--      -->, <i><!-- --></i> </div> </div> <div class="content c2"> <!-- --> <div class="clear"></div> </div> <div class="btnBack n2" style="display: block;"> <i>← </i> </div> <ul class="tags"> <!----> </ul> <div class="infopanel"> <!--   (,, ,   , ...)--> <div class="voting"> <a href="#plus" class="plus" title=""></a> <div style="position: relative;" class="mark"> <a class="score" title=" " href="#">—</a> </div> <a href="#minus" class="minus" title=" "></a> </div> <div class="pageviews" title=""><!-- --></div> <div class="favorite"> <a class="add" title="  " href="#"> </a> </div> <div class="favs_count" title=" ,    "><!-- --></div> <div class="author"> <a title=" "> <!----> </a> <span class="rating" title=" "><!----></span> </div> <div class="informative"> <a title=" "></a> </div> <div class="showComm btnBack inln">→</div> <div class="published"><!----></div> </div> <div class="showComm btnBack" style="display: block;"> <i>← </i> </div> <div class="comments_list"> <h2 class="title"> <!-- --> </h2> <!----> </div> <div class="showComm btnBack n2" style="display: block;"> <i>← </i> </div> </div>

Using such a pattern everywhere, we get rid of the volatility of the usual site-like templates:
*) article
*) questions and answers,
*) Sandbox article,
*) loading the article from the search,
*) loading from favorites list.

For example, the format of questions and answers will correspond to the format of the article. However, for full compliance it is necessary to rebuild the answers themselves, this is another stage of parsing, and you can leave it as it is and depend on the current layout.

When is a view dependency required? In the case of long articles, blocks of the article require the title, date and author to be shown in at least 2 places - at the top and bottom, and it is better to track the position of long texts in general, and when the author is not visible, show it in the pop-up window within the window, as well as the title articles, and date. Convenience will increase by an order of magnitude, and it is often not done on sites — after all, this is an additional programming cost. With the built-in presentation template, we can make the necessary logic on the script and styles once for all similar blocks of the site and even for other similar sites. If the article is loaded into a page with several blocks-articles, the contents of the pop-up window changes depending on the context.

For the initial work with such a script, data about the author and the date are loaded in 2 places - before and after the article. If the article is short, smaller than the window size, then two fields are not needed, the script will hide one of them. If the article is larger, the script puts the data in a floating window for the moments of their invisibility in the main places.

And sometimes in the question-answer format there is a clarification of the question. If we parse the update together with the question - there are no future problems, except that, again, we depend on the layout of the site. If the parsem is separate, the “Question Refinement” block appears in our template, which is always empty, except for the cases of clarification questions. Now we lekgo make it invisible, zero height. But if this were not possible, then it would be necessary to delete this block or a special qa class for display rules.

Parser

However, the synthesis of texts on the client is never complex and requires clarification. It is much more interesting to explain how the analysis of pages, parsing. There are already written libraries somewhere, but it is interesting to make a bicycle yourself, and then compare the complexity of the core of the parser with the existing analogues. Given that there is no garbage in the core, the rest of the quality level is determined by the writing time and the depth of the analysis of the task entities.

What do we need from parsing? In the simplest case, several text objects should be selected from the page. If these were our pages from the server, it is better to exchange all over JSON and transfer all the necessary entities (objects). Since we are considering parsing simple foreign HTML pages, we will have to analyze the text (HTML with possible errors, that is, not XHTML).

For example, in the old existing parser 2 objects were allocated - the text of the article and the list of comments. This was done in 2 regular expressions (theoretically, it is possible and in one) and in more than a year of use, parsing had to be corrected once, when the lines involved in parsing changed at the expense of the authors of the site. A little later, it took to do the parsing of tags - keywords at the article. Here is an example of the first regular expression with parsing the article and tags.

 var conte = this.responseText.match( // ======  ,  ====== /<div class="content html_format">([\s\S]*?)<div class="clear"><\/div>\s+?<\/div>[\s\S]+?(<ul class="tags">|<div class="tags">)\s*([\s\S]*?)\s*(<\/div>|<\/ul>)[\s\S]*?<div class="infopanel"/m) //  (   )

In this way, for several source code recognitions, it is not very fast, but you can select several objects we usually need. And this is usually enough - you do not need to do a full analysis of the DOM, then to take a piece of it.

If you want to speed up recognition, you can cut a common block of text into pieces and then analyze the pieces. The method is old-fashioned and slow, but does not require advanced libraries and can work for years on the same site.

If you need to enter the best level of parsing, you will have to work on this part or search for third-party solutions. For example, here the next thing you want (after pulling out the estimates and parameters from the info panel, which is fast) is to parse the comment tree. Moreover, in order not to do an analysis every time anew for different sites, it would be nice to write a general procedure. Then we can either keep the comments compact, or arrange them all in wood with our layout, which is better for decoration.

When the task of complete control of the elements of the block arises (to immediately find out whether something new has appeared in the layout), then you need a complete tag parser in the string. But for now, we will safely postpone it for the future.

Different templates for parsing and synthesis

When displaying our own representation of the block, we solve a very similar problem compared to parsing. But, nevertheless, these are different tasks, if we take into account the rule of independence of our presentation from the original. There will be one parsing template - and another (or others) - to build the view. This is a fundamental difference from the “usersstyly + little userscript” approach, in which they try to combine both fairly similar templates, so much so that the display of the templates themselves in the code is not needed - the original template is a block from the site, the final template is the same slightly modified by scripts, and all visual differences are superimposed by styles.

The template with this approach (userstily) turns out to be invisible in the code in general. It is, as it were, dissolved in the incoming data, invisibly present and quite obvious only to the creator of the script, and even then, if you are constantly working with it. Therefore, any small change in the source (alien) code of the page greatly affects all subsequent results - the scripts may stop working; so that they do not create errors, you need to use the technique of safe data analysis, with the provision of any errors in them; styles will be distorted and the page will look erroneous. If we consider that there are different browsers, the normality of the display in them is also difficult to maintain - every change to the original page can break the display in any place.

If we separate the views for the analysis and synthesis of blocks, the complexities that are not related to each other are also separated. The task is structured, the complexity of the solution decreases. Analysis (parsing) errors remain with them, and at most what they can do is the absence of some or all of the data previously extracted from the pages. The display of blocks becomes dependent only on its layout and, possibly, on filling it with data.

Therefore, we will not be surprised at the presence of 2 or more similar templates in the resulting script. One is for parsing and verification (site block analysis), the other is for building your own block representation. The third and beyond are possible other views of the same data that need their own HTML templates and CSS rules.

So, tastes of various layers of readers are satisfied. If earlier everyone had to get used to and “love” the only available presentation of the site, now for those who are accustomed you can recreate an exact or almost exact copy or even make such tricks as keeping the old design (the site introduced a progressive design, and everyone sits on its old version) . For those who think differently, use their own, other versions of design and layout, which are no longer easily broken by changes to the original site.

Technically, it makes sense to connect different layouts as modules of the main script. If the main script sees the layout module, it includes it as an alternative or as the main one, depending on its settings.

Setting up the block in the case of userscript

If we do not have a tool to work with the server to debug the script, use the full capabilities of javaskipt in the design and debug mode. Want to put a template blank inside the finished page? Just write the script to insert the template in the right place DOM page.

 var tpArtic ='......'; var rotPosts ='...  ...'; if(rotPosts) ...     ...;

// (The code is intentionally not specified here, so as not to explain the long internal structure of its procedures)

Seeing the template in the page, fill it with test content manually and typeset so that it does not depend on the styles of the framing page. Then use the adjusted template and styles for its intended purpose.

If styles and patterns are supposed to be replaced by others, arrange the introduction and deletion of CSS rules by script settings.

Active blocks on the news site

So it turns out that the main template that interests us to display data on news sites is described by one template, which can be called the “article with comments”. There are more user data and special site settings, but these templates are much less interesting to observe. Therefore, having made at least one template of this kind, we will be able to read through it any news and articles that fit into it and prevail on about 90% of sites. The grim prophecy of A. Lebedev is coming true that “sites will lose face”, thousands of designers will start to lose bread, all their hope remains for retirees who have not yet learned how to parse news with scripts. Unless you specifically take care of the “Source of News” field in the template, the news itself will lose face, more precisely, the part that is responsible for showing its intermediary.

But while we are on the same site, this horrible apocalyptic script is not yet in effect, and you can safely distort the pure designer vision of the source site — the user knows what is on one site.

Hidden powers

Having made a parser-synthesizer of blocks, we get another source of strength for our own client application. We got rid of the monolith layout, which is fed to ordinary visitors to the site. Usually, only advertising scammers intervene in the site layout. Here we get the structure of useful blocks in detail. We can save it in machine JSON-format, send it to the archive server and support for the period of inactivity of the main server. In general, we get everything that ordinary RSS parsers or ordinary, but usually driven page parsers do. With the advertising of the site - here we, of course, touch on another issue, not reading the information, but the model of earning a website through the display of advertising. Rebuilding the model can solve such issues. Most likely, this business is not very distant future, when advertising in simple pages will become even worse. But now we will not engage in inventing or supporting the monetization model, but just think about information that can be interrupted from a single sider (source site).

Here numerous hard-working copyists come to the rescue, who themselves are not averse to cash in on third-party advertising, showing fresh and useful content in the center, and search engine caches, which for a period of 1-2 weeks retain the useful content of the page, doing it with a certain frequency - hour-hours . Therefore, they remember not the freshest last seen state before the page has ceased to be accessible (the site is down, the page has been deleted, rewritten, reduced) and always, of course, look at the site through the eyes of a guest, an unauthorized user.

Immediately we get a copy of the page, which we are ready to share with others who have not had time to look at it by users in the very latest form that the “sider” allowed us to do. We get something like a peer-to-peer network, with the only difference that browser clients cannot directly distribute content to other peers. They need one or more servers that play the role of redundant siders, significantly reducing the load on the initial sider.

If the initial sider works, the spare siders simply accumulate content (without layout and advertising). Their role begins in those rare moments when the initial sider stopped working. This is the hidden power that lies in the mechanism of client parsing of source pages. in order for it to begin to manifest, it is necessary, of course, to refine the model and transfer useful data to the backup server.

Parse your own page

Imagine that we have already downloaded the page where we are and want to parse it. In order to rewrite it in your own layout or extract data in order to get a full own idea of the pages of the site, you do not need to re-download your own page (by the URL of the current page) if you just use document.body.innerHTML or the deeper content part of the page . It is necessary to remember only that almost all browsers cut down innerHTML within their competence, swap tags attributes and sometimes add their attributes. But the result is a text that is readable enough for the parser, which we have already learned to parse at the previous decision step.

Having a mechanism for parsing your own page, we can display it at once in our presentation (design, layout), if at first styles hide all the content of the page in the layout of the site. Sometimes such a move can be useful, depending on the depth of penetration of your own style of display in the pages of the site.

results

Now this template and the way of its filling allows to place a copy of the block “article-comments” anywhere, even on empty space of your page. If earlier copies were placed only in 5 cases of the presence of block blanks (in the place of annotations), then now articles from the footer can also be loaded under 1 of 3 articles. Of course, with some preparation of the code, place the button for activating the upload, indicate the place where the article is expanded. With old scripts this is also possible, but old scripts were worse read, worse adapted for development, unstructured (more difficult for reading and development), would require more code for preparation.

Finally, this code, which we called the block loading pattern, can be used in other places, as described in the first half of the article. It cannot be taken as a procedure, it is in the literal sense a pattern, i.e. a template that implements similar solutions. It is a bit like a very complex snippet - a piece of code that can be transferred and adapted for work in another place where the same sum of tasks is required.

The prototype of this piece of code with the upload pattern occupied 350 lines of code with comments, including an attempt to arrange an iterator in it on the states of the article block a year earlier. In fact, the iterator was already “dissolved” in the code, it traced 8 pieces of states and showed the rules for block opening depending on the current state. For example, when uploading questions and answers, he showed a detailed question and answers, and when uploading an article or comments to it, it showed either an article or comments - depending on where the user clicks.

Now it takes more lines of code, but the functions have been expanded, the code has become transparent (for the author) and managed. Part of the data is in the template, and before that they were generated from the javascript code.

More is expected from the entire change of approach - instead of fine coding and battling windmills of various page states, we get a different system and untie the source site’s coders from the way of thinking.

Source: https://habr.com/ru/post/165139/

All Articles