Since the number of active Instagram users has steadily increased, Postgres has remained our solid foundation and consistent data store for most user-generated data. And although less than a year ago we wrote about how we store a large amount of data on Instagram at 90 likes per second, we are now processing more than 10,000 likes per second - and our basic data storage technology has not changed.
Over the past two and a half years, we realized a few things and picked up a couple of tools for scaling Postgres and we want to share them - what we would like to know when launching Instagram. Some of them are specific to Postgres, others are also presented in other databases. To know how we scale Postgres horizontally, see our
Sharding and IDs post
on Instagram
1. Partial Indexes
If you often use filters for a specific characteristic in your queries, and this characteristic is presented in a smaller part of the rows of your database, partial indexes can help you a lot.
')
For example, when searching for tags on Instagram, we are trying to bring up tags by which many photos can be found. Although we use technologies such as ElasticSearch for more biased searches in our application, this is the only case where the database does a good job on its own. Let's take a look at how Postgres works when searching for tags, sorting them by the number of photos:
EXPLAIN ANALYZE SELECT id from tags WHERE name LIKE 'snow%' ORDER BY media_count DESC LIMIT 10; QUERY PLAN Limit (cost=1780.73..1780.75 rows=10 width=32) (actual time=215.211..215.228 rows=10 loops=1) -> Sort (cost=1780.73..1819.36 rows=15455 width=32) (actual time=215.209..215.215 rows=10 loops=1) Sort Key: media_count Sort Method: top-N heapsort Memory: 25kB -> Index Scan using tags_search on tags_tag (cost=0.00..1446.75 rows=15455 width=32) (actual time=0.020..162.708 rows=64572 loops=1) Index Cond: (((name)::text ~>=~ 'snow'::text) AND ((name)::text ~<~ 'snox'::text)) Filter: ((name)::text ~~ 'snow%'::text) Total runtime: 215.275 ms (8 rows)
Notice Postgres has to sort 15 thousand lines to get the right result. And since tags (for example) are a template with a long tail, we can instead try first to show tags for which there are 100 or more photos, so:
CREATE INDEX CONCURRENTLY on tags (name text_pattern_ops) WHERE media_count >= 100
And our query plan now looks like this:
EXPLAIN ANALYZE SELECT * from tags WHERE name LIKE 'snow%' AND media_count >= 100 ORDER BY media_count DESC LIMIT 10; QUERY PLAN Limit (cost=224.73..224.75 rows=10 width=32) (actual time=3.088..3.105 rows=10 loops=1) -> Sort (cost=224.73..225.15 rows=169 width=32) (actual time=3.086..3.090 rows=10 loops=1) Sort Key: media_count Sort Method: top-N heapsort Memory: 25kB -> Index Scan using tags_tag_name_idx on tags_tag (cost=0.00..221.07 rows=169 width=32) (actual time=0.021..2.360 rows=924 loops=1) Index Cond: (((name)::text ~>=~ 'snow'::text) AND ((name)::text ~<~ 'snox'::text)) Filter: ((name)::text ~~ 'snow%'::text) Total runtime: 3.137 ms (8 rows)
Note that Postgres should now bypass only 169 lines, which is much faster. The Postgres query scheduler is also good at calculating constraints - if you later decide that you want to receive only tags for which there are at least 500 photos, i.e. from a subset of the index - it will still use the correct partial index.
2. Functional Indexes (Functional Indexes)
For some of our tables, we need to index rows (for example, 64-character base64 tokens), rather long, in order to create an index on them - this will result in duplication of a large amount of information. In this case, Postgres functional indexes can be very useful:
CREATE INDEX CONCURRENTLY on tokens (substr(token, 0, 8))
Thus, Postgres, using an index, finds a set of records by a prefix, and then filters them, finding the necessary one. The index at the same time takes up 10 times less space than if we were doing an index on the entire line.
3. pg_reorg for compression
After a period of time, Postgres tables may be fragmented on disk (due to Postgres' competitive MVCC model, for example). Also, most often, the insertion of rows is not in the order in which you want to receive them. For example, if you frequently request all likes created by a single user, it would be nice if those likes were recorded on disk continuously to minimize searches on the disk.
Our solution for this is to use the pg_reorg utility, which performs these steps in the table optimization process:
- Gets an exclusive lock on the table
- Creates a temporary table to accumulate changes, and adds a trigger to the source table, which replicates any changes to this temporary table.
- Makes CREATE TABLE using SELECT FROM ... ORDER BY, which creates a new table in indexed order on disk.
- Synchronizes changes from the temporary table that occurred after SELECT FROM was run.
- Switches to the new table
There are some features in obtaining locks, etc., but this is a description of the general approach. We checked this tool and ran a number of tests before running it in production, and we did a lot of reorganizations on hundreds of machines without any problems.
4. WAL-E for archiving and WAL backups
We use and contribute to the development of
WAL-E , the Heroku platform toolkit for the continuous archiving of Postgres WAL (Write-Ahead Log) files. Using WAL-E greatly simplified our backup and launch process for a new replica database.
In essence, WAL-E is a program that archives all WAL files generated by your PG server on Amazon S3 using Postgres archive_command. These WAL files can be used, in combination with a backup database, to restore the database to any point, starting with this backup. The combination of normal backups and WAL files gives us the ability to quickly launch a new replica of a read-only or failover slave (replica in case of failure of the main base).
We made a simple wrapper script for monitoring duplicate failures when archiving a file, and it
is available on GitHub .
5. Auto-commit mode and asynchronous mode in psycopg2
After some time, we began to use the more advanced features of psycopg2, the Python driver for Postgres.
The first is autocommit mode. In this mode, psycopg2 does not require BEGIN / COMMIT for any requests; instead, each request is launched in a separate transaction. This is particularly useful for database query requests, for which the use of transactions does not make sense. Switching on mode is very simple:
connection.autocommit = True
This significantly reduced communication between our servers and databases, as well as reduced CPU costs on database machines. Later, we used the PGBouncer to distribute connections, which allowed connections to return to the pool faster.
More details on how this works in Django
here .
Another useful feature in psycopg2 is the ability to register a
wait_callback to call subroutines. Its use allows you to make parallel requests on several connections at a time, which is useful for requests involving multiple nodes - the socket wakes up and notifies when data is available for processing (we use the Python select module for handling awakenings). It also works well with co-op multithreaded libraries such as an eventlet or gevent. An example implementation is
psycogreen .
In general, we are very satisfied with Postgres performance and reliability. If you are interested in working on one of the largest Postgres assemblies, along with a small team of infrastructure hackers, we are at infrajobs <dog> instagram.com
From the translator:
I ask translation errors and spelling in lichku. And if you are interested in working on mobile applications in a small, but very proud team, located in St. Petersburg - we are also in touch, and consider proposals!