Parsing WKB format without third-party libraries

In the process of working on one task in the Mail.Ru Map project, it became necessary to read the WKB format. Of course, you could use GDAL , but you only needed reading, and all other features were superfluous. And so this small bicycle was born.
I want to immediately warn you that the functionality is implemented only within the required framework and only basic types are supported, such as:

point
line
polygon
multiline
multipolygon

Recoding of the byte order in a word is not supported. And so - take it and use it.

For convenience, a vrapper class representing a memory region as an STL-like container is also useful; there's nothing to even comment on. It can be taken from here . And additional simplest classes for geometric primitives: points and rectangles.

Class code

namespace ml { struct point_d { double x; double y; }; inline std::ostream & operator << (std::ostream &s, point_d const &p) { return s << px << ", " << py; } inline double distance(point_d const &p1, point_d const &p2) { double dx = (p1.x-p2.x); double dy = (p1.y-p2.y); return sqrt(dx*dx+dy*dy); } struct rect_d { point_d min; point_d max; static rect_d void_rect() { return rect_d(std::numeric_limits<T>::max(),std::numeric_limits<T>::max(), -std::numeric_limits<T>::max(),-std::numeric_limits<T>::max()); } bool empty() const { return (*this == rect_d::void_rect() || (width()==0 && height()==0)); } inline double height() const { return fabs(max.y-min.y); } inline double width() const { return fabs(max.x-min.x); } rect_d & extend(const rect_d &r) { if(r.empty()) return *this; min.x = std::min(min.x,std::min(r.min.x,r.max.x)); min.y = std::min(min.y,std::min(r.min.y,r.max.y)); max.x = std::max(max.x,std::max(r.min.x,r.max.x)); max.y = std::max(max.y,std::max(r.min.y,r.max.y)); return *this; } template< typename C > rect_d & bounds( C begin, C end) { for(; begin != end; ++begin) { min.x = std::min(min.x,begin->x); min.y = std::min(min.y,begin->y); max.x = std::max(max.x,begin->x); max.y = std::max(max.y,begin->y); } return *this; } }; inline std::ostream & operator << (std::ostream &s,const rect_d &r) { return s << '[' << r.min << ',' << r.max << ']'; } } // namespace ml

To use the class, you must first load the data in WKB format into memory. How this will be done, let it be left to the discretion of who will use it.
It was supposed to use the written class approximately as follows:

 char *data = .... ml::wkb shape(data); std::cout << ml::wkb::type_to_text(shape.type()) << std::endl; std::cout << shape.bounds() << std::endl;

The first version turned out to be a solution to the forehead, and it looked like this:

First version

 namespace ml { class wkb { char const * m_data; public: enum type_e{ wkb_point = 1, wkb_line_string = 2, wkb_polygon = 3, wkb_multi_point = 4, wkb_multi_line_string = 5, wkb_multi_polygon = 6, wkb_geometry_collection = 7 }; static char const *type_to_text(type_e type) { switch (type) { case wkb_point: return "wkb_point"; case wkb_line_string: return "wkb_line_string"; case wkb_polygon: return "wkb_polygon"; case wkb_multi_point: return "wkb_multi_point"; case wkb_multi_line_string: return "wkb_multi_line_string"; case wkb_multi_polygon: return "wkb_multi_polygon"; case wkb_geometry_collection: return "wkb_geometry_collection"; default: return "unknown"; } } typedef container<ml::point_d> line; wkb(char const *data) : m_data(data) {} wkb & operator = (wkb const &d) { m_data = d.m_data; return *this; } inline char byte_order() const { return m_data[0]; } inline type_e type() const { return (type_e)*(int*)&m_data[1]; } inline int size() const { return *(int*)&m_data[5]; } line linestring() const { return line((line::value_type*)&m_data[9], size()); } ml::point_d const & point() const { return *(ml::point_d*)&m_data[5]; } wkb linestring(size_t index) const { size_t offset = 9; for(size_t i=0; i<size(); ++i) { wkb l(&m_data[offset]); if (i == index) return l; offset += 9 + 16 * l.size(); } return wkb(0); } line ring(size_t index) const { size_t offset = 9; for(size_t i=0; i<size(); ++i) { int point_count = *(int*)&m_data[offset]; if(i == index) return line((line::value_type*)&m_data[offset+4], point_count); offset = offset + 4 + 16*point_count; } return line(NULL,(size_t)0); } wkb polygon(size_t index) const { size_t offset = 9; for(size_t i=0; i<this->size(); ++i) { wkb o(&m_data[offset]); if(i == index) return o; for(size_t j=0; j<o.size(); ++j) offset += o.ring(j).size() * 16 + 4; offset += 9; } return wkb(0); } ml::rect_d bounds() const { ml::rect_d r(ml::rect_d::void_rect()); switch (type()) { case wkb_polygon: { for(size_t i=0; i < size(); ++i) { line l = ring(i); r.bounds(l.begin(),l.end()); } } break; case wkb_multi_polygon: { for(size_t i=0; i < size(); ++i) { r.extend(polygon(i).bounds()); } } break; case wkb_point: { r.min.x = r.max.x = *(double*)&m_data[5]; r.min.y = r.max.y = *(double*)&m_data[13]; } break; case wkb_line_string: { line l = linestring(); r.bounds(l.begin(),l.end()); } break; case wkb_multi_line_string: { for(size_t i=0; i < size(); ++i) { r.extend(linestring(i).bounds()); } } break; default: break; } return r; } }; } // namespace ml

In principle, everything worked and gave correct results, but (where without this “but”) when we had to process several million objects, the time spent on processing did not suit us.
')
As a result of the profiling, it became clear why a lot of time was spent on calculating the necessary part of the WKB. Many intermediate objects were created on the stack and there was a small but recursion available. Scratching the back of the head, we decided to refine the code somewhat and get rid of unnecessary recursion, and indeed of unnecessary actions. To begin with, we added the code to calculate the end of the current element and get the beginning of the next element:

  inline const char * end_of_ring(const char *ptr) const { return ptr + 4 + (*(unsigned int *)(ptr) * 16); } inline const char * end_of_polygon(const char *ptr) const { for (size_t sz = *(unsigned int*)((ptr+=9)-4);sz--;) ptr += (*(unsigned int *)(ptr) * 16) + 4; return ptr; }

After that, the methods for obtaining elements were rewritten, they became simpler and more obvious:

Hidden text

  line linestring() const { return line((line::value_type*)(m_data+9), size()); } ml::point_d const & point() const { return *(ml::point_d*)(m_data+5); } wkb linestring(size_t index) const { assert(index>=0 && index<m_size); const char *ptr = m_data+9; while (index--) { ptr += 9 + (*(unsigned int*)(ptr+5) * 16); } return wkb(ptr); } wkb polygon(size_t index) const { assert(index>=0 && index<m_size); const char *ptr = m_data+9; while (index--) { ptr = end_of_polygon(ptr); } return wkb(ptr); } line ring(size_t index) const { assert(index>=0 && index<m_size); const char *ptr = m_data+9; while (index--) { ptr = end_of_ring(ptr); } return line((line::value_type*)(ptr+4),(line::value_type*)end_of_ring(ptr)); }

This has already increased productivity. But given the specifics of processing points, it was possible to further improve the architecture and convenience by adding the following method:

Hidden text

 template< typename T > T & apply( T & func) const { switch (type()) { case wkb_polygon: { const char * ptr = m_data+9; for(size_t i=0; i<m_size; ++i) { ml::point_d *begin = (ml::point_d*)(ptr+4); ml::point_d *end = begin + *(unsigned int *)(ptr); ptr = (const char *)end; func(begin,end); } } break; case wkb_multi_polygon: { const char * ptr = m_data+9; for(size_t i=0; i<m_size; ++i) { for (size_t sz = *(unsigned int*)((ptr+=9)-4);sz--;) { ml::point_d *begin = (ml::point_d*)(ptr+4); ml::point_d *end = begin + *(unsigned int *)(ptr); ptr = (const char *)end; func(begin,end); } } } break; case wkb_point: { ml::point_d *begin = (ml::point_d*)(m_data+5); ml::point_d *end = begin + 1; func(begin,end); } break; case wkb_line_string: { ml::point_d *begin = (ml::point_d*)(m_data+9); ml::point_d *end = begin + size(); func(begin,end); } break; case wkb_multi_line_string: { const char * ptr = m_data+9; for(size_t i=0; i<m_size; ++i) { ml::point_d *begin = (ml::point_d*)(ptr+9); ml::point_d *end = begin + *(unsigned int *)(ptr+5); ptr = (const char *)end; func(begin,end); } } break; default: break; } return func; }

we add a functor to compute a descriptive rectangle like this:

  struct bounds_counter { ml::rect_d r; bounds_counter() : r(ml::rect_d::void_rect()) {} void operator()(ml::point_d const *begin, ml::point_d const *end) { r.bounds(begin,end); } ml::rect_d const & bounds() const { return r; } };

bounds method after rewriting began to look like this:

 ml::rect_d ml::wkb::bounds() const { struct bounds_counter { ml::rect_d r; bounds_counter() : r(ml::rect_d::void_rect()) {} void operator()(ml::point_d const *begin, ml::point_d const *end) { r.bounds(begin,end); } ml::rect_d const & bounds() const { return r; } }; bounds_counter b; return apply(b).bounds(); }

This option will not work for g ++, it follows standards more strictly:

14.3.1 / 2: for a type of parameter.

but it is fully functional in clang and, judging by the search on the Internet, in MSVC ++ (honestly, we did not check). For universality and correctness, the definition of a functor should be placed in an unnamed namespace.

After all the improvements, the task execution time has decreased about four times. Also, when using this approach, it was convenient to write output of objects in JSON format:

Hidden text

 std::string ml::wkb::to_geo_json() const { struct point_formatter { std::stringstream &ss; unsigned int counter; bool parts; point_formatter(std::stringstream &s) : ss(s),counter(0), parts(false) {} void operator() (ml::point_d const *begin, ml::point_d const *end) { if (parts) ss << (counter ? ",[" : "["); ss << "[" << begin->x << "," << begin->y << "]"; for(++begin;begin != end;++begin){ ss << ",[" << begin->x << "," << begin->y << "]"; } if (parts) ss << "]"; ++counter; } point_formatter & make_parts(bool p) {parts = p; counter = 0; return *this;} }; std::stringstream ss; point_formatter fmt(ss); ss << std::fixed << std::showpoint << std::setprecision(6); ss << "{"; switch(type()) { case wkb_point: { ss << "\"type\":\"Point\",\"coordinates\":"; apply(fmt); } break; case wkb_line_string: { ss << "\"type\":\"LineString\",\"coordinates\": ["; apply(fmt); ss << "]"; } break; case wkb_multi_line_string: { ss << "\"type\":\"MultiLineString\",\"coordinates\": ["; apply(fmt.make_parts(true)); ss << "]"; } break; case wkb_polygon: { ss << "\"type\":\"Polygon\",\"coordinates\": ["; apply(fmt.make_parts(true)); ss << "]"; } break; case wkb_multi_polygon: { ss << "\"type\":\"MultiPolygon\",\"coordinates\": ["; for(size_t i=0; i < size(); ++i) { ss << (i ? ",[" : "["); polygon(i).apply(fmt.make_parts(true)); ss << "]"; } ss << "]"; } break; default: break; } ss << "}"; return ss.str(); }

The result was a fairly convenient class for working with data presented in WKB format. It became convenient to add new operations on objects. Reached decent performance.

All the code in the article can be found here and here .

Ershov Sergey,
programmer project Maps@Mail.Ru

Source: https://habr.com/ru/post/167517/

All Articles

Parsing WKB format without third-party libraries

More articles: