Everything you need to know about partitioning (Part 1)

Part 2

Good evening / day / morning dear habra people! We continue to develop and complement the blog about my favorite open source rdbms Postgresql. Miraculously, it so happened that the topic of today's topic has never been raised here. I must say that the partitioning in postgresql is very well described in the documentation , but will it really stop me?).

Introduction

In general, sectioning is generally understood not as some kind of technology, but rather as an approach to database design, which appeared long before the DBMS started supporting so-called. partitioned tables. The idea is very simple - to divide the table into several smaller parts. There are two subspecies - horizontal and vertical sectioning.

Horizontal partitioning

Parts of a table contain different rows of it. Suppose we have a table of logs of some abstract application - LOGS. We can break it apart - one for logs for January 2009, another for February 2009, and so on.
')

Vertical partitioning

Parts of a table contain different columns. Finding an application for vertical partitioning (when it is really justified) is somewhat more difficult than for horizontal one. As a spherical horse, I propose to consider this option: the NEWS table has the columns ID, SHORTTEXT, LONGTEXT, and let the LONGTEXT field be used much less frequently than the first two. In this case, it makes sense to split the NEWS table into columns (create two tables for SHORTTEXT and LONGTEXT, respectively, connected by primary keys + create view NEWS containing both columns). Thus, when we only need a description of the news, the DBMS will not have to read the entire text of the news from the disk.

Sectioning support in modern DBMS

Most modern DBMSs support partitioning of tables in one form or another.

Oracle - supports partitioning starting with version 8. Working with sections on the one hand is very simple (in general, you can not think about them, you work like with a regular table *), and on the other hand everything is very flexible. Sections can be divided into "subpartitions", delete, divide, transfer. Various options for indexing a partitioned table are supported (global index, partitioned index). A reference to the volumetric description.
Microsoft SQL Server - support for partitioning appeared recently (in 2005). The first impression of use is “Well, finally !! :)”, the second is “It works, everything seems to be ok.” Msdn documentation
MySQL - supports starting with version 5.1. Very good description on Habré
And so on…

* - Of course, there is a standard set of difficulties - create a new section in time, throw out an old one, etc., but somehow everything is simple and clear.

Postgresql partitioning

The partitioning of tables in postgresql is slightly different in implementation from other databases. The basis for partitioning is the inheritance of tables (a thing exclusively inherent in postgresql). That is, we should have a main table (master table), and its sections will be successor tables. We will consider sectioning on the example of a task close to reality.

Formulation of the problem

The database is used to collect and analyze data about site visitors / sites. The data volumes are large enough to reflect on partitioning. In the analysis, in most cases, data for the last day is used.
1. Create the main table:

CREATE TABLE analytics.events ( event_id BIGINT DEFAULT nextval( 'analytics.seq_events' ) PRIMARY KEY , user_id UUID NOT NULL , event_type_id SMALLINT NOT NULL , event_time TIMESTAMP DEFAULT now() NOT NULL , url VARCHAR (1024) NOT NULL , referrer VARCHAR (1024), ip INET NOT NULL ); * This source code was highlighted with Source Code Highlighter .

2. We will partition by day across the event_time field. Every day we will create a new section. Name the sections by the rule: analytics.events_DDMMYYYY. Here is an example section for the 1st of January 2010.

CREATE TABLE analytics.events_01012010 ( event_id BIGINT DEFAULT nextval( 'analytics.seq_events' ) PRIMARY KEY , CHECK ( event_time >= TIMESTAMP '2010-01-01 00:00:00' AND event_time < TIMESTAMP '2010-01-02 00:00:00' ) ) INHERITS (analytics.events); * This source code was highlighted with Source Code Highlighter .

When creating a section, we explicitly set the event_id field (PRIMARY KEY is not inherited) and create a CHECK CONSTRAINT on the event_time field so as not to insert too much.

3. Create an index on the event_time field. When splitting the table into sections, we mean that most queries to the table events will use the condition on the event_time field, so an index on this field will help us a lot.

CREATE INDEX events_01012010_event_time_idx ON analytics.events_01012010 USING btree(event_time); * This source code was highlighted with Source Code Highlighter .

4. We want to ensure that when inserted into the main table, the data will be in the section intended for them. To do this, do the next trick - create a trigger that will control the data flow.

CREATE OR REPLACE FUNCTION analytics.events_insert_trigger() RETURNS TRIGGER AS $$ BEGIN IF ( NEW .event_time >= TIMESTAMP '2010-01-01 00:00:00' AND NEW .event_time < TIMESTAMP '2010-01-02 00:00:00' ) THEN INSERT INTO analytics.events_01012010 VALUES ( NEW .*); ELSE RAISE EXCEPTION 'Date % is out of range. Fix analytics.events_insert_trigger' , NEW .event_time; END IF ; RETURN NULL ; END ; $$ LANGUAGE plpgsql; * This source code was highlighted with Source Code Highlighter .

CREATE TRIGGER events_before_insert BEFORE INSERT ON analytics.events FOR EACH ROW EXECUTE PROCEDURE analytics.events_insert_trigger(); * This source code was highlighted with Source Code Highlighter .

5. Everything is ready, we now have a partitioned analytics.events table. We can begin to violently analyze its data. By the way, we created CHECK constraints not only to protect sections from incorrect data. Postgresql can use them when scheduling a query (though with a live index on event_time, this will give a minimal gain), just use the constraint_exclusion directive:

SET constraint_exclusion = on ; SELECT * FROM analytics.events WHERE event_time > CURRENT_DATE ; * This source code was highlighted with Source Code Highlighter .

The end of the first part

So what do we have? Let the points:
1. Table events, divided into sections, analysis of available data for the last day becomes easier and faster.
2. The horror of realizing that all this needs to be somehow supported, to create sections on time, not forgetting to change the trigger accordingly.

In the second part I will talk about how easy and carefree to work with partitioned tables.

UPD1: Replaced Partitioning for Partitioning
UPD2:
Based on the comments of one of the readers who, unfortunately, does not have an account on Habré:
With inheritance associated with a few points that should be considered when designing. Sections do not inherit the primary key and foreign keys on their columns. That is, when creating a section, you need to explicitly create PRIMARY KEY and FOREIGN KEYs on the section columns. From myself, I note that creating FOREIGN KEY on columns of a partitioned table is not the best way. In most cases, a partitioned table is a “fact table” and itself refers to the “dimension” of the table.

Source: https://habr.com/ru/post/74984/

All Articles