📜 ⬆️ ⬇️

The State Duma launched an open API to search for bills

This article is the second of a series of articles on innovations on the website of the State Duma ( Article 1 ).

At the moment, the concept of open government is gaining popularity. For example, data.gov publishes large amounts of US government data, and similar UK material is published on data.gov.uk. An important aspect of the publication of structured information is the possibility of its receipt in a machine-readable form. It is clear that the HTML table can be parsed quite successfully, but the provision of information in a convenient form for integration with external systems is a very important indicator of openness. Therefore, the development of an API for the search for bills system has become an important stage in the implementation of the “open state” concept within the State Duma website . Now data on bills can be easily integrated into external information systems. For example, an analytical portal can place a widget next to an article devoted to a bill, which will reflect current information on the progress of the bill.

API features


The provided API implements, in fact, the search for bills, taking into account a variety of parameters. In addition, it provides access to all sorts of directories and viewing transcripts for each bill. The list of requests and their parameters can be found in the documentation .
')
Consider a few examples of search queries processed by the API:

Search results for these queries can be viewed here , and the source code for PHP is in the documentation .

The interaction with the API is carried out in accordance with the REST architecture style, that is, the HTTP protocol is used, the response is given in XML, JSON (including JSONP) and RSS formats. A wide range of supported formats allows you to apply the API in various situations. It can be accessed directly from the browser's JavaScript. You can send requests from the server side of the application or even from a fat client. And you can subscribe to a specific search query via RSS, and directly follow the latest information.

Statistics


Since the launch of the new website of the State Duma, more than 70 thousand people have used the search system on bills. With the introduction of the API, the search promises to be even more in demand, so it was important that it not only worked, but also worked quickly. Before direct optimization, statistics were collected on what queries users make in the existing form of searching for bills.

First, the frequency of using search parameters was analyzed (the sum of percents is greater than 100, since several parameters can be used in one query).



As you can see, the distribution is very uneven, it is immediately clear by what parameters the data should be indexed first. Of course, after the publication of the API, the distribution of requests may turn out to be somewhat different than the form for searching bills on the site, but a good guide was obtained.

Now let's see how users use pagination.



The graph well reflects the fact that users rarely flip through the results far: more than 97% of requests fall on the first 3 pages. Thus, it was possible to neglect the optimization of the issuance of pages with a large number.

Next was considered the distribution of queries by type of sorting.



As you can see, very few people change the default sorting, so this sorting should first be issued by index.

One method of optimizing performance is caching the results of search queries. In order for caching to make sense, queries must be repeated often enough. How the number of queries with the same search parameters was distributed can be seen in the graph below.



It can be seen that requests are repeated often enough, so caching is justified.

And in the completion of statistics, TOP-10 search queries.



One of the hottest topics at the time of collecting statistics was the law on education.

Optimization


Along with the development of the API, there was also the task of transferring the Draft Law database from Oracle to PostgreSQL (data was imported using ora2pg). The following optimization methods were applied.

Uses PostgreSQL built-in full-text search using GIN indexes.

The indices were constructed for the fields for which sorting is available, and for those for which filtering is most often performed.

Previously, when using Oracle DBMS, for paged navigation, a restriction was imposed on the rownum virtual column, reflecting the row number in the sample. Those. if you want to show, for example, the first 20 lines returned by some query, the following construct was used.

 SELECT 
   x. * 
 FROM 
   (main_query) x
 WHERE
   rownum <= 20


It was, in general, not the most correct way, but in the version used (Oracle 9i) it worked fine. And when it was necessary to show, for example, 20 lines starting from 100, the following construction was used. In it, the internal query selects the first 120 lines, and the external leaves only the last 20 of them.

 SELECT 
   * 
 FROM (
   SELECT 
     rownum AS xrownum, 
     x. * 
   FROM 
     (main_query) x
   WHERE
     rownum <= 120)
 WHERE
   xrownum> 100;


With the transition to PostgreSQL, LIMIT and OFFSET constructions began to be applied. In addition, for optimization purposes, the LIMIT and OFFSET constructs were located as far as possible in the table joins. Consider an example. Let us have tables a and b unite, sorting follows the field with the index. Then if you apply LIMIT and OFFSET to the entire request, you get the following query plan.

 SELECT
   *
 FROM
     b
   JOIN
     a
   ON
     b.a_id = a.id
 ORDER BY
   b.value
   b.id
 LIMIT 20
 OFFSET 100;


                                            QUERY PLAN                                           
 -------------------------------------------------- ----------------------------------------------
  Limit (cost = 52.30..62.76 rows = 20 width = 39)
    -> Nested Loop (cost = 0.00..522980.13 rows = 1000000 width = 39)
          -> Index Scan using b_value_id_idx on b (cost = 0.00 ..242736.13 rows = 1000000 width = 27)
          -> Index Scan using a_pkey on a (cost = 0.00 ..0.27 rows = 1 width = 12)
                Index Cond: (a.id = b.a_id)


Those. the b_value_id_idx index is traversed, and for each found record of table b , the records of table a are added at index a_pkey . The first 100 records found in this way are skipped and the next 20 are taken. Now let's see what happens if we apply LIMIT and OFFSET exclusively to table b . In more detail about the organization of page navigation you can read here .

 SELECT
   *
 FROM
   (
     SELECT
       *
     FROM
       b
     ORDER BY
       b.value
       b.id
     LIMIT 20
     OFFSET 100
   ) b
   JOIN
     a
   ON
     b.a_id = a.id;


                                               QUERY PLAN                                              
 -------------------------------------------------- -------------------------------------------------- -
  Hash Join (cost = 29.44..49.39 rows = 20 width = 60)
    Hash Cond: (a.id = public.b.a_id)
    -> Seq Scan on a (cost = 0.00..16.00 rows = 1000 width = 12)
    -> Hash (cost = 29.19..29.19 rows = 20 width = 48)
          -> Limit (cost = 24.16..28.99 rows = 20 width = 27)
                -> Index Scan using b_value_id_idx on b (cost = 0.00 ..241620.14 rows = 1000000 width = 27)


As you can see, now the connection with table a is already taken after selecting only the necessary rows of table b , thereby saving. And if it joins in such a way not 2, suppose 10 tables, then the effect can be very significant. In principle, the airframe has all the necessary information to carry out such optimization itself, given that the b.a_id field is not null, and the a.id field is the primary key. But he does not know how to do it, perhaps the situation will change in future versions. So you can take as a rule to place LIMIT and OFFSET as “deeper” as possible, it won't get any worse.

Query results caching is organized using memcached. Each request to search for bills is characterized by:

There were two types of cache:

Thus, the data is stored in the cache compactly, thereby increasing the number of cached requests, although the cost of processing them increases because part of the data is loaded from the main database.

Optimization of the database allowed us to reduce the processing time of the test sample of 65 thousand search queries from 10 hours to 15 minutes.

Total


Thanks to the work done, everything that could be done earlier through the search form on the draft laws on the site can now be done programmatically and used in a variety of situations. We hope that our efforts were not made in vain, and the developed API will become a useful and sought-after service.

As an example, we decided to develop a mobile application “Search by Draft Laws”, where all the features that are available in the API, as well as the functions of subscribing to search queries with receiving notifications about new draft laws, will be used. Soon we will publish it.

Source: https://habr.com/ru/post/133305/


All Articles