The story of one experiment with Cython and C ++ vector

One ~~warm~~ on a cold winter evening, I wanted to warm up in the office and test the theory of one colleague that the C ++ vector could cope with the task faster than the CPython list.

In the company we are developing products based on Django and it so happened that it was necessary to process one large array of dictionaries. A colleague suggested that the implementation in C ++ would be much faster, but I had the feeling that Guido and the community were probably a little steeper than us in C and could have already decided and avoided all the pitfalls, having implemented everything much faster.

To test the theory, I decided to write a small test file in which I decided to loop through the insertion of 1M dictionaries of the same content into an array and vector 100 times in a row.

The results, though expected, were also sudden.

It so happened that we actively use Cython, so in general, the results will differ in fully CPython implementation.

Stand

Calculate Linux onegreyonewhite 4.18.14-calculate # 1 SMP PREEMPT Sat Oct 13 21:03:27 UTC 2018 x86_64 Intel® Core (TM) i7-4770 CPU @ 3.40GHz GenuineIntel GNU / Linux
Python 2.7 and 3.6
Cython 0.28.3
gcc (Gentoo 7.3.0-r3 p1.4)

Script

By the way, I had to tinker here. In order to get the most real numbers (i.e., not just to make it super-optimized, but also so that we can use it without dancing with a tambourine), we had to do everything in the main script, and reduce all additional .h to a minimum.

The first problem was that the Cython wrapper for the vector does not want to work in this form:

#    ctypedef vector[object] dict_vec #     (   vector.push_back(dict())) ctypedef vector[PyObject*] dict_vec #   ,   ( ,    object   PyObject.) ctypedef vector[PyObject] dict_vec

With all this, they got an error that it is impossible to cast dict to PyObject. Of course, these are Cython problems, but since we use it, we need to solve this particular problem.
I had to make a little crutch in the form

 #include "Python.h" static PyObject * convert_to_pyobject(PyObject *obj) { return obj; }

The most amazing thing is that it worked. What scares me the most is that I do not fully understand why and what the consequences are.

Final Sources

cython_experiments.h

 #include "Python.h" static PyObject * convert_to_pyobject(PyObject *obj) { return obj; }

cython_experiments.pyx

 # -*- coding: utf-8 -*- # distutils: language = c++ # distutils: include=['./'] # distutils: extra_compile_args=["-O1"] from __future__ import unicode_literals import time from libc.stdlib cimport free from cpython.dict cimport PyDict_New, PyDict_SetItemString from cpython.ref cimport PyObject from libcpp.string cimport string from libcpp.vector cimport vector cdef extern from "cython_experiments.h": PyObject* convert_to_pyobject(object obj) ctypedef vector[PyObject*] dict_vec range_attempts = 10 ** 6 # Insert time cdef test_list(): t_start = time.time() data_list = list() for i from 0 <= i < range_attempts: data_list.append(dict( name = 'test_{}'.format(i), test_data = i, test_data2 = str(i), test_data3 = range(10), )) del data_list return time.time() - t_start cdef test_vector(): t_start = time.time() cdef dict_vec *data_list data_list = new dict_vec() data_list.resize(range_attempts) for i from 0 <= i < range_attempts: data = PyDict_New() PyDict_SetItemString(data, 'name', 'test_{}'.format(i)) PyDict_SetItemString(data, 'test_data', i) PyDict_SetItemString(data, 'test_data2', str(i)) PyDict_SetItemString(data, 'test_data3', range(10)) data_list.push_back(convert_to_pyobject(data)) free(data_list) return time.time() - t_start # Get statistic times = dict(list=[], vector=[]) attempts = 100 for i from 0 <= i < attempts: times['list'].append(test_list()) times['vector'].append(test_vector()) print(''' Attempt: {} List time: {} Vector time: {} '''.format(i, times['list'][-1], times['vector'][-1])) avg_list = sum(times['list']) / attempts avg_vector = sum(times['vector']) / attempts print(''' Statistics: attempts: {} list avg time: {} vector avg time: {} '''.format(attempts, avg_list, avg_vector))

Attempt 1

I really want to be able to collect * .whl for the project and that it all started on almost any system, so the optimization flag was first set to 0. This led to a strange result:

 Python 2.7 Statistics: attempts: 100 list avg time: 2.61709237576 vector avg time: 2.92562381506

After some thought, I decided that we would still use the -O1 flag, so I set it up and got it:

 Python 2.7 Statistics: attempts: 100 list avg time: 2.49274396896 vector avg time: 0.922211170197

Once I got a little excited: all the same, the belief in the professionalism of Guido and Co. let me down. But then, I noticed that the script is somehow suspiciously eating the memory, and by the end it ate up about 20GB of RAM. The problem was as follows: in the final script, you can observe the function free, after passing the cycle. At this iteration it was not yet. Then I thought ...

And if I do not disable gc?

Between attempts I did gc.disable () and after trying gc.enable () . I run the build and script and get:

 Python 2.7 Statistics: attempts: 100 list avg time: 1.00309731514 vector avg time: 0.941153049469

In general, the difference is not big, so I thought that there is no point ~~overpay~~ try to somehow pervert and just use CPython, but still collect it with Cython.
Probably many have a question: "And what about memory?" The most amazing (no) is that nothing. She grew at the same rate and in the same amount. An article came to mind, but I didn’t want to go into Python sources. And this meant only one thing - the problem in the implementation of the vector.

The final

After long torment with type conversion, namely, that the vector takes a pointer to the dictionary, that final script was obtained and with gc turned on, I received an average difference of 2.6 times (vector faster) and relatively good work with memory.

Suddenly it dawned on me that I collect everything only under Py2.7 and did not even try to do anything with 3.6.

And here I was really surprised (after previous results, the surprise was natural):

 Python 3.6 Statistics: attempts: 100 list avg time: 0.8771139788627624 vector avg time: 1.075702157020569 Python 2.7 Statistics: attempts: 100 list avg time: 2.61709237576 vector avg time: 0.92562381506

With all this, gc still worked, the memory was not erased and it was the same script. Understanding that after a little more than a year, I would have to say goodbye to 2.7, I still didn’t give rest that there was such a difference. Most often, I heard / read / experimented and Py3.6 was slower than Py2.7. However, the guys from Cython-developers did something incredible and changed the situation at the root.

Total

After this experiment, we decided not to bother much with the support of Python 2.7 and the alteration of any parts of the applications in C ++, simply because it is not worth it. Everything has already been written to us, we can only use this correctly to solve a specific problem.

UPD 12/24/2018:
According to the advice of iCpu, and after attacks to the side, what is being checked is not understanding what and how, I tried to rewrite the C ++ part as convenient as possible for further development, as well as minimize abstractions. It turned out even worse:

The result of poor knowledge of C ++

cython_experiments.h

 #include "Python.h" #include <vector> #include <algorithm> #ifndef PyString_AsString #define PyString_AsString PyUnicode_AsUTF8 #define PyString_FromString PyUnicode_FromString #endif typedef struct { char* name; bool reverse; } sortFiled; class cmpclass { public: cmpclass(std::vector<char*> fields) { for (std::vector<char*>::iterator it = fields.begin() ; it < fields.end(); it++){ bool is_reverse = false; char* name; if (it[0] == "-"){ is_reverse = true; for(int i=1; i<strlen(*it); ++i) name[i] = *it[i]; } else { name = *it; } sortFiled field = {name, is_reverse}; this->fields_to_cmp.push_back(field); } } ~cmpclass() { this->fields_to_cmp.clear(); this->fields_to_cmp.shrink_to_fit(); } bool operator() (PyObject* left, PyObject* right) { // bool result = false; for (std::vector<sortFiled>::iterator it = this->fields_to_cmp.begin() ; it < this->fields_to_cmp.end(); it++){ // PyObject* str_name = PyString_FromString(it->name); PyObject* right_value = PyDict_GetItem(right, str_name); PyObject* left_value = PyDict_GetItem(left, str_name); if(!it->reverse){ result = left_value < right_value; } else { result = (left_value > right_value); } PyObject_Free(str_name); if(!result) return false; } return true; } private: std::vector<sortFiled> fields_to_cmp; }; void vector_multikeysort(std::vector<PyObject *> items, PyObject* columns, bool reverse) { std::vector<char *> _columns; for (int i=0; i<PyList_GET_SIZE(columns); ++i) { PyObject* item = PyList_GetItem(columns, i); char* item_str = PyString_AsString(item); _columns.push_back(item_str); } cmpclass cmp_obj(_columns); std::sort(items.begin(), items.end(), cmp_obj); if(reverse) std::reverse(items.begin(), items.end()); } std::vector<PyObject *> _test_vector(PyObject* store_data_list, PyObject* columns, bool reverse = false) { int range_attempts = PyList_GET_SIZE(store_data_list); std::vector<PyObject *> data_list; for (int i=0; i<range_attempts; ++i) { data_list.push_back(PyList_GetItem(store_data_list, i)); } vector_multikeysort(data_list, columns, reverse); return data_list; }

cython_experiments.pyx

 # -*- coding: utf-8 -*- # distutils: language = c++ # distutils: include=['./'] # distutils: extra_compile_args=["-O2", "-ftree-vectorize"] from __future__ import unicode_literals import time from libc.stdlib cimport free from cpython.dict cimport PyDict_New, PyDict_SetItemString from cpython.ref cimport PyObject from libcpp.string cimport string from libcpp.vector cimport vector import gc cdef extern from "cython_experiments.h": vector[PyObject*] _test_vector(object store_data_list, object columns, int reverse) range_attempts = 10 ** 6 store_data_list = list() for i from 0 <= i < range_attempts: store_data_list.append(dict( name = 'test_{}'.format(i), test_data = i, test_data2 = str(i), test_data3 = range(10), )) # Insert time def multikeysort(items, columns, reverse=False): items = list(items) columns = list(columns) columns.reverse() for column in columns: # pylint: disable=cell-var-from-loop is_reverse = column.startswith('-') if is_reverse: column = column[1:] items.sort(key=lambda row: row[column], reverse=is_reverse) if reverse: items.reverse() return items cdef test_list(): t_start = time.time() data_list = list() for i in store_data_list: data_list.append(i) data_list = multikeysort(data_list, ('name', '-test_data'), True) for i in data_list: i del data_list return time.time() - t_start cdef test_vector(): t_start = time.time() data_list = _test_vector(store_data_list, ['name', '-test_data'], 1) for i in data_list: i return time.time() - t_start # Get statistic times = dict(list=[], vector=[]) attempts = 10 gc.disable() for i from 0 <= i < attempts: times['list'].append(test_list()) times['vector'].append(test_vector()) gc.collect() print(''' Attempt: {} List time: {} Vector time: {} '''.format(i, times['list'][-1], times['vector'][-1])) del store_data_list avg_list = sum(times['list']) / attempts avg_vector = sum(times['vector']) / attempts print(''' Statistics: attempts: {} list avg time: {} vector avg time: {} '''.format(attempts, avg_list, avg_vector))

 Python 3.6 Statistics: attempts: 10 list avg time: 0.2640914678573608 vector avg time: 2.5774293661117555

Any idea what could be improved in the coparator to make it work faster?

Source: https://habr.com/ru/post/433852/

All Articles