PipelineDB: working with data streams

In previous publications, we have already addressed the problem of handling events in real time. Today we would like to return to this topic again and tell you about a new and very interesting tool - streaming DBMS PipelineDB .

PipelineDB is based on the PostgreSQL 9.4 code base and is fully compatible with it. Its first release took place in July 2015, and in January 2016, the enterprise version was released .
')
Below, we will compare PipelineDB with existing solutions of the same plan, give a brief installation and initial configuration instructions, and analyze a practical example.

Real-time data processing: a tour of history

The principle of operation of PipelineDB can be formulated as: "constant requests, short-term data." In relational DBMS, everything is exactly the opposite: “short-term queries, constant data. PipelineDB does not store data, but enters the stream; their processing occurs on the fly, in motion.

The first attempts to create tools for processing data in motion go back to the late 1980s, when so-called Active Database Systems appeared . They were extensions to existing DBMSs for handling events using triggers and rules . As an example of solutions of such a plan, HiPAC , Starburst or ODE can be called.
However, they did not receive wide distribution: the scope of their application was quite narrow, and the syntax of the rules was too complicated and confusing.

In the 1990s - early 2000s, Data Stream Management Systems came into being: TelegraphCQ (fork PostgreSQL), StreamBase , StreamSQL . The principle of operation of these tools was as follows: with the help of the so-called window operators (window operators), streams were converted into tables, against which SQL queries could then be applied.
The emergence of such solutions was undoubtedly a step forward, but they could not provide high speed and performance when working with large data streams.

Tools oriented to data processing without storage have become widespread in the last 5-6 years. From the most famous examples should be highlighted, in particular, Storm and Heron . From appeared relatively recently - Apache Calcite . All of these solutions are characterized by the complexity of installation and configuration, as well as a very high threshold of entry.

The advantages of PipelineDB over the tools mentioned above are obvious:

easy setup: to get started, just download and install the necessary packages;
ease of development (a consequence of compatibility with PostgreSQL).

Consider how PipelineDB works with data streams. Let's start with the analysis of two important concepts: “continuous presentation” and “stream”.

Streams and Continuous Views

"Stream" and "continuous presentation" are the main abstractions of PipelineDB.
A thread is a sequence of events. Writing events to a stream is done in the same way as writing to tables in relational db (for more details, see here ). When an event enters the stream, a timestamp is added to it.

Streams in PipelineDB perform an auxiliary function, which is to supply data for continuous views. Unlike tables, streams do not need to create schemas. Data can be written to the stream as long as it interacts with at least one continuous view.

The continuous view (eng. Continuous view) is a selection of streams and tables, updated as new data becomes available. Events that are selected by certain parameters fall into continuous views.

To better understand how PipelineDB works, here are some examples of continuous views.

So, for example, you can create a continuous view for daily counting the number of unique visitors coming to the site through external links:

CREATE CONTINUOUS VIEW uniques AS SELECT date_trunc('day', arrival_timestamp) AS day, referrer::text, COUNT(DISTINCT user_id::integer) FROM users_stream GROUP BY day, referrer;

Another example is counting the number of ad impressions on a site in the last 5 minutes:

 CREATE CONTINUOUS VIEW imps AS SELECT COUNT(*) FROM imps_stream WHERE (arrival_timestamp > clock_timestamp() - interval '5 minutes');

As you can see, continuous views have the following form:

 CREATE CONTINUOUS VIEW name AS query

When creating a continuous view against threads, a SELECT operation is performed; it is used to select data corresponding to the required parameters.

The basic theoretical information necessary to understand the principles of the work of PipelineDB, we have outlined. We proceed to the practical part. First we describe the procedure for installing and initializing PipelineDB, and then proceed to a practical example.

Installation and initial setup

The installation procedure for PipelineDB will be described on the Ubuntu 14.04 OC material. If you are using another Linux distribution, refer to the official documentation .

To install PipelineDB, just run two commands:

 $ wget https://www.pipelinedb.com/download/0.9.3/ubuntu14 $ sudo dpkg -i ubuntu14

After that, we initialize the PipelineDB server:

 $ pipeline-init -D [ ]

In the -D option, you can specify the name of the new directory that will be created automatically. Here is a list of the contents of this directory:

 base pg_hba.conf pg_replslot pg_subtrans pipelinedb.auto.conf global pg_ident.conf pg_serial pg_tblspc pipelinedb.conf pg_clog pg_logical pg_snapshots pg_twophase postmaster.opts pg_commit_ts pg_multixact pg_stat PG_VERSION postmaster.pid pg_dynshmem pg_notify pg_stat_tmp pg_xlog

The main PipelineDB settings are stored in the pipelinedb.conf file. They are almost the same as PostgreSQL settings.

By default, PipelineDB cannot accept connections from remote hosts. To change this setting, open the file pipelinedb.conf, find the Connections and Authentication section in it, uncomment the first line and edit it as follows:

 listen_addresses = '*'

After that, we will write specific hosts in the pg_hba.conf file:

 host all all <IP->/<> md5

If we need to accept connections from all possible hosts, this line should look like this:

 host all all 0.0.0.0/0 md5

That's all. PipelineDB is ready to go.
To run it in the background, run the following command:

 $ pipeline-ctl -D [ ] -l pipelinedb.log start

Practical example: analyzing Wikipedia statistics

We analyzed the necessary theory, as well as described the installation and initial configuration of PipelineDB. We turn to the peculiarities of using PipelineDB in practice.

We will consider an interesting example, which is given in the official documentation of PipelineDB: the analysis of statistics of hits to Wikipedia pages and related projects per hour (Wiktionary, Wikisources, Wikibooks and others). These statistics are publicly available . Information on each appeal is presented in the form of a record consisting of the following fields:

 treatment time |  project |  number of views for |  total byte served

We will be interested in the maximum, minimum and average number of hits to the page within an hour, as well as the 99th percentile of hits.
We activate the execution of continuous queries:

 $ psql -h localhost -p 5432 -d pipeline -c "ACTIVATE"

After that, create a continuous view:

 $ psql -h localhost -p 5432 -d pipeline -c "CREATE CONTINUOUS VIEW wiki_stats AS SELECT hour::timestamp, project::text, count(*) AS total_pages, sum(view_count::bigint) AS total_views, min(view_count) AS min_views, max(view_count) AS max_views, avg(view_count) AS avg_views, percentile_cont(0.99) WITHIN GROUP (ORDER BY view_count) AS p99_views, sum(size::bigint) AS total_bytes_served FROM wiki_stream GROUP BY hour, project;" CREATE CONTINUOUS VIEW

The above command indicates that we will receive data for the continuous view from the wiki_stream stream. To create such a stream, we will need to download data from the site, unzip it, write to standard output, and then transfer PipelineDB using the COPY command:

 $ curl -sL http://pipelinedb.com/data/wiki-pagecounts | gunzip | \ psql -h localhost -p 5432 -d pipeline -c " COPY wiki_stream (hour, project, title, view_count, size) FROM STDIN"

Note that the amount of data is very large (they are stored as archives of 80-90 MB each), and their loading may take a long time. Downloading can be stopped at any time by pressing the standard Ctrl + C key combination.

When the download is complete, execute the command:

 $ psql -h localhost -p 5432 -d pipeline -c " SELECT * FROM wiki_stats ORDER BY total_views DESC";

The result will be presented in the form of a table (here is a small fragment):

Conclusion

PipelineDB is an interesting and promising product. We hope that it will develop successfully in the future.
If you have experience using PipelineDB in practice - we will be happy if you share your experience in the comments.

For those who want to learn more, here are some useful links:

Anyone who for one reason or another can not leave comments here, we invite to our blog .

Source: https://habr.com/ru/post/306978/

All Articles