Stack of technologies under consideration : Postgresql 9.3, Python 2.7 with the “psycopg2” module installed.
Problem
How often in your practice did you encounter the task of processing large tables (more than 10 million records)? I think you will agree that this task is quite resource-intensive both in terms of processing time and the system resources involved. Today I will try to show an alternative way to solve the problem.
Sentence:
DBMS Postgresql has an excellent operator for working with large volumes of information, namely “COPY”. The use of this operator allows us to read and write huge amounts of information in a table. In this article we will consider the reading mode.
')
According to the documentation of the “COPY” operator, we have several read modes available in the file or in the STDOUT stream, as well as various formats, including “csv”. Just we will try to use it with maximum benefit.
Training:
As a “guinea pig”, we will create a table with 1 million records and write a small script reflecting the essence of the method. The sql file can be found in my git repository (the link can be found at the bottom of the article).
Also do not forget to install the psycopg2 extension!
Implementation:
To fetch the data, we will use the wonderful function “copy_expert”, which allows us to execute “COPY” requests from the Python client.
query = """ SELECT * from big_data inner join big_data as t1 USING(fname) """ output = StringIO() self.cursor.copy_expert("COPY (%s) TO STDOUT (FORMAT 'csv', HEADER true)" % query, output) data = output.getvalue() output.close() result = list() for item in getResults(data):
Explanation of the code:
- In the query, we do a join for ourselves, for its complication (it is noted that the advantage in speed is directly proportional to the complexity of the query);
- For the buffer, use the “StringIO” object, where we will write data from the cursor;
- Parse string will be the generator "getResults";
- For convenience of interpretation, I convert all blank lines to the “None” type, since after using “COPY” we get the string values;
- I want to note that the format I will use “csv” with the leading header line, why you will understand that way, a little later.
Generator code:
def getResults(stream): """ get result generator """ f = StringIO(stream) result = csv.DictReader(f, restkey=None) for item in result: yield item f.close()
Explanations:
- As you can see from the listing, again we use the already familiar “StringIO” buffer;
- To convert the string “csv” into a dictionary (dictionary) we use the “DictReader” method of the native csv library. By default, this method takes the first line as a list of dictionary fields.
That's all we need!
My configuration : MacBook Air 2013 Processor: 1.3 GHz Intel Core i5, Ram: 4 GB 1600 MHz DDR3, SSD.
PS:I want to note that this approach to speeding up reading does not always work, namely, if you have a fairly simple table of 3-5 fields, you will not notice a tangible difference (at least up to 1 million). However, this method shows just a crazy increase in speed, with complex queries, the acceleration reaches up to 10-20 times! Also, the configuration of the hardware on which the script runs is very strongly affected.
All code can be found in the git repository
https://github.com/drizgolovicha/python_bulk_read .
I would be happy with comments and suggestions for optimization!
Thank you for reading to the end.
UPD :
Results of measurements sample (14k) records:
- Direct SELECT, Where the condition for the non-indexed field is 21.4s
- COPY previous request - 13.1s
- Selection of the same SELECT, but from the materialized view with an index across the field - 12.6 s
- COPY materialized view - 1.8c