Flat GeoIP or single column range

In the article published on the eve (February 2012) entitled “ Determining a country by IP: testing the speed of algorithms ”, implementations were compared at the database and native implementation levels. We propose to consider an even more optimal and simple algorithm that can be implemented both in the database and in the native variant - flat ranges.

Implementation or algorithm?

Without at all detracting from the native version, let's try not to compare the implementations, namely the algorithms. The current algorithm uses two columns in the version with a database and also the boundaries of the ranges in the native. But what if we keep not the ranges, but only their starting points? A similar idea in the comments to the article has already been expressed: "... Everything should not be stored, only the ends of the ranges ...". Indeed, in this case, the search with us is clearly accelerated twice - we only need to find the beginning of the range, which includes the value. Those who notice that there are empty ranges, we will answer that they can be stored: the beginning of the void and the meaning that symbolizes it.

Simple ranges or nested?

In the example considered in the previous example, simple in our opinion, GeoLite Country was used, whose ranges do not intersect and there is no problem of nested ranges. In a more complicated case, such as, for example, the Russian IpGeoBase , the ranges mercilessly intersect and we can’t do with simple requests of the type needle between start and end . But the help of the search algorithm has come to the rescue!
')

The concept of flat ranges

Let's consider and compare the concept of flat ranges on the example of the database. Also, for simplicity, we will consider the option of finding one value at a time, rather than a list of values. The subtitle will be a question, and the two elements of the list will be the answers for the normal and flat ranges.

What it is?

Three columns: start, end, value
Two columns: start, value

How are empty ranges counted?

No, in the simple range they do not need to be taken into account, since in the case of a query, zero records will be returned to the empty range.
The beginning of the empty range is stored in the start column and a certain designation of emptiness in the value column - this can be NULL, but it can also be 0 if we use a composite primary key on start and value .

How many more posts do you get?

In the simplest case, the GeoLiteCountry type is 160222.
In a complex, type IpGeoBase - 39449.
In the complex and old type IpGeoBase of the previous format, where there are many consecutive ranges (for example, from November 2011) - 148666.
GeoLiteCountry - 161619. An increase of less than 1%.
IpGeoBase - 46047. An increase of only 16%.
IpGeoBase of the previous format - 31156. A decrease of as much as 80% !

How to search for it?

In the simple case: needle between start and end limit 1 .
In the complex, when we have nesting we need to connect an additional column diff, which will be used in the composite index and as a result we get a horror of the type:
Any of the cases will be simple: neddle> = start order by start desc limit 1 .

What do DB or any other search provider have to do?

First go in one direction, and then make sure that everything is fine from another direction. Those. generally make two moves.
Always go in one direction.

How to get it?

We already have it - any download from a file with geoip ranges.
Take a table with three columns and use special street magic to convert it into a table with three columns. See Appendix 1 .

Conclusion

Of course, you ask - how do we convert the data into a flat view and how much will it increase? We answer from the end: the volumes will increase approximately twice, and we will post the SQL conversions below with comments (see Appendix 1 ).
PS: Most likely, when applying the concept of flat ranges in the native version, everything will be even more cosmically instantaneous. Request to those who can check - check this statement.
PS2: Also, something tells me that in an era of more active IPv6, if by that time HUGEINTs are not common, flat ranges are also very useful if storage is in the form of CHARs.

Appendix 1. Creating flat ranges on the example of MySQL

Taken from github.com/garex/geoip-flat-range , namely from github.com/garex/geoip-flat-range/blob/master/01-create-flat-range-mysql.sql

-- Create intermediate table with 3 columns -- Change this to your columns and/or table drop table if exists t3; create table t3 select range_start f, range_end t, country_code v from countries_ips ; alter table t3 add primary key (f,t,v); -- Create target table with 2 columns and fill it with all distinct ranges borders drop table if exists t2; create table t2 select distinct border as f, (select max(v) from t3) as v from ( select f-1 as border from t3 union select f from t3 union select t from t3 union select t+1 from t3 ) inn order by f; -- Here we just reset value column as it was filled by max value to have auto created column with needed type update t2 set v = null; -- We can add PK here, as all our range borders are unique alter table t2 add primary key(f); -- Adding diff column, that will help us to order ranges during main update alter table t3 add column diff int(10) unsigned, add unique index dif_f(diff, f); update t3 set diff = tf; -- Create helper table, that will help to smooth main update drop table if exists t3diff; create table t3diff select distinct diff from t3 order by diff; -- Here are our MAIN update update t3diff, t2, t3 set t2.v = t3.v where t3.diff = t3diff.diff and t2.f between t3.f and t3.t; -- We dont' need 'em anymore drop table if exists t3; drop table if exists t3diff; -- We should remove records, that points to the same value and is one after another alter table t2 drop primary key; alter table t2 add column row_number int(10) unsigned not null auto_increment primary key; alter table t2 add column next_row_number int(10) unsigned not null; update t2 set next_row_number = row_number + 1; alter table t2 add unique index next_row_number_v (next_row_number, v); delete t2.* from t2, ( select cur.row_number from t2 as cur join t2 prev on cur.row_number = prev.next_row_number and cur.v = prev.v ) as inn where t2.row_number = inn.row_number; -- Also we dont' need first record delete from t2 where row_number = 1; -- Removing extra columns, that will not help us anymore -- And also adding primary key on key and value to just always use index instead of table alter table t2 drop column row_number, drop column next_row_number, drop primary key, drop index next_row_number_v, add primary key (f, v) ; -- ... And renaming target table to more human readable form -- Change table`s/columns` names/definitions to your tables/columns drop table if exists countries_ips_flat; alter table t2 rename to countries_ips_flat, change column f range_start int(20) unsigned not null default 0 first, change column v country_code varchar(2) not null default '' after range_start; -- Comparing records count and check, that's all is ok select (select count(*) from countries_ips) as default_range_records, (select count(*) from countries_ips_flat) as flat_range_records;

Source: https://habr.com/ru/post/138536/

All Articles