Map / Reduce DIY - Apache CouchDb

I warn you - my opinion does not pretend to any objectivity at all. But the relational database never inspired me, to put it mildly.

No, I fully understand when you really have an application focused on processing and storing large amounts of data. Well, ERP-systems, all sorts of storage, statistics there, "last month they sold a hundred thousand pencils, this two hundred."

On the other hand, in most cases, when it comes to desktop (or web) applications, where there is no need to roll millions of primitive records, and the application works with relatively high-level, complex objects, the essence of “database design and engineering” is to repeat two action:
')

a) split these high-level objects into a bunch of simple fields — numbers, strings, and complex dependencies between them, and scatter between dozens of tables. This is usually not very difficult, but some (or many?) Data types are not so pleasantly and organically arranged in this model - for example, tags in blog entries;

b) then persistently collect these fields into objects back, using four-story JOINs, megabytes of wrappers code, curves and not very ORM layers - depending on the developer’s qualifications, in general, overcome the infamous O / R impedance mismatch in every possible way. At the same time, handwritten JOINs do not show the wonders of performance and flexibility, and those generated automatically by a smart layer of wrappers are even more so.

In principle, ORM libraries in dynamic languages (see SQLAlchemy ) are quite pleasant to use, however, they do not allow elegantly solving another painful issue - with the upgrade of the scheme.

In general, many applications use databases to store complex data structures, and at the same time, really complex queries using internal dependencies to this data are rarely needed in practice or not needed at all (except for mega-JOINs in order to just back pick out your structures from the database). It seems that the usual RDBMS is not very suitable for them - the problems mentioned above are rather painfully solved, and millions of man-hours are spent by database developers on the realization of other features that are useless for them.

One solution is object-oriented storage, they are really becoming quite popular and deserve a separate discussion. They solve the problem with ORM transparently, but if we talk about web applications (which are very interesting for us in the light of the promised version of defun.ru :), object-oriented databases are not exactly what the doctor prescribed - they do not solve horizontal problems scalability and distribution of data, and the web is first of all a lot of textual information, it would be nice to somehow take this into account.

So, CouchDb is a document- oriented database. She is able to store documents - objects consisting of a heap of fields with an arbitrary structure. Each document has only two required service fields: the name and version, the names are unique and are in linear space - imagine a giant directory with document files, like this:

  { 
  "_id": "63086444D554D3094C080F96D5005B03",
  "_rev": "1837603925",
  "author": "lrrr",
  "tags": ["baz", "test", "ru"],
  "url": "http: \ / \ / incubator.apache.org/couchdb",
  "title": "couchdb home",
  "description": "boo boo ba ba",
  "type": "story",
  "comments": 1,
  "votes": 2
 }

Versions are needed to organize parallel access to the database - remember how your version control system works - if we want to change the document, we just take it, change it and try to put it back - if its version has not changed, everything is fine, if it has changed - you can simply try to make the same changes again, with a new document, or else somehow make a merge (depending on the application). This is called optimistic locking, the main plus is that no one locks the document for the duration of the editing, and therefore no need to wait for unlocking. By the way, such a mechanism can also be applied in some modern RDBMS, only at the row level in the table (see http://www.google.com/search?q=%22row+versioning%22 ).

The interface to CouchDb is HTTP only, REST exclusively, and the response from the server comes in JSON format . At first, this is somewhat alarming - not the most efficient protocol, but given the fact that high-level documents are stored in it in its entirety, it is not necessary to make 5-10 queries to the database for each one. But there are a lot of advantages: firstly, any language can work with HTTP and JSON (and if it doesn’t know how to teach), secondly, it is easy to debug, thirdly CouchDb understands HTTP Etag and If-None-Match, which means effortlessly screwed to the base HTTP cache.

But everything should be scaled in breadth perfectly - after all, Amazon SimpleDb and Google BigTable are built around this pattern. Amazing, by the way, a coincidence, but SimpleDb and CouchDb are written in erlang;)

What distinguishes CouchDb from Google and Amazon services is the more “advanced” functionality in the field of data queries.

Naturally, less structured data is more difficult to process, and since we are so concerned about scalability, these requests should also be easily distributed across a cluster of database servers. To do this, CouchDb uses the map / reduce pattern described in a famous article by Google engineers.

In practice, it looks like this: on the server, view-functions (the actual map () and reduce ()) are stored in special documents, which convert the set of documents as needed, and can be accessed using the same REST interface. They are able to calculate gradually, with preservation of intermediate results, that is, if one or two documents were added or changed between two calls, the function will be called only for them. They are written in JavaScript, but you can easily connect python / ruby / something else instead.

As an added bonus - support for full-text document search, using any external library (while the authors screwed the Apache Lucene search engine to CouchDb).

* * *

In the end, it is usually customary to kick a little at the technology in question, but I’m still sorry to kick CouchDb - it makes too pleasant impression. Although, of course, this is still just an alpha version, with all the ensuing consequences (reduce, say, appeared in a trunk three days ago). Yes, it is very slow - as long as it can handle dozens of insert orders per second (if you do not use bulk update mode) and yes, it eats a lot of disk space - since all intermediate versions of the document are saved unless they are periodically deleted with a special function “ Compact Database ”- however, this can be done in parallel without stopping the application. However, for alpha, the system is very stable and already has, among other things, a very pleasant and functional web interface for administration and development.

Original post

More links:

Official site of the project
Damien Katz - Lead CouchDb Programmer
Blogs for Two: Christopher Lenz , Jan Lehnardt
Top 10 Reasons to Avoid Document Databases FUD is a good article, why not be afraid of non-relational databases
Amazon SimpleDB and CouchDB compared
Ajatus - a distributed CRM system using CouchDb

Source: https://habr.com/ru/post/25841/

All Articles

Map / Reduce DIY - Apache CouchDb

More articles: