Down with the shackles of MongoDB

Many of us rushed to master MongoDB with enthusiasm, really beauty is a convenient JSON format, a flexible scheme (or rather its complete absence), literally minutes pass from system installation to first use. But after a while, when Mongo is securely “wired” into our project, disappointment comes. The simplest requests require constant poking into the documentation, a little more complex can kill almost a whole day of working time, and if you need to join different collections, then alas ...

And now someone is returning to Postgres with his partial support for JSON ...

But, fortunately, it is already forging, a full-fledged replacement for Mongo, a full-fledged semi-structured Big Data DBMS AsterixDB, is already in a hurry. This project is headed by UCI professor Michael Carey, a student of legendary pioneer of the database, Michael Stounbraker.
')
The project started simply as a research initiative in the field of Big Data and initially focused on creating a common stack for MapReduce and SQL. But, just a few years ago, it was decided to build a Big Data JSON database. According to Michael Carey, “AsterixDB is Mongo done right.” What are the main AsterixDB chips?

1. Scheme. Yes, the scheme is still useful, and it is not necessary to completely get rid of it. I am sure that in any JSON repository some of the fields are known in advance, fixed and not subject to change. But naturally, Asterix does not force you to completely design the entire data scheme. You can continue to live without a scheme. But, if you want to introduce a bit of order - those fields that are fixed are entered into the data scheme, leaving the rest “open”. What does this give? Data verification during insert, more compact storage, a visual representation of what you have where it lies.

Try on:

 create type TwitterUserType as open {
         screen-name: string,
         lang: string,
         friends_count: int32,
         statuses_count: int32,
         name: string,
         followers_count: int32
     }

We created the TwitterUserType data type, all the listed fields in it are fixed, but since the type is open, you can add arbitrary other fields. Now, based on this type, we can create a “parent” type:

 create type TweetMessageType as closed {
         tweetid: string,
         user: TwitterUserType,
         sender-location: point ?,
         send-time: datetime
         referred-topics: {{string}},
         message-text: string
     }

Here we see the "user" field, the type of which corresponds to TwitterUserType. I did not climb into the code, but I think that it is possible to immediately describe nested structures without assigning them to some named types. But, even if not, I am sure that such functionality will appear soon.

Ah, yes, the data types supported by AsterixDB are:

Boolean
Int8 / Int16 / Int32 / Int64
Float
Double
String
Point - geometry
Line - geometry
Rectangle Geometry
Circle - geometry
Polygon - geometry
Date
Time
Datetime
Duration / Year-month-duration / Day-time-duration
Interval - a time interval, in AsterixDB, Alain's logic is implemented for time intervals, a terrible thing, but it can be useful

Well and nested types:

Record
Orderedlist
UnorderedList

2. Query language! Sometime in Mongo there comes a time when you need to group, aggregate data, and produce a certain subset of the source fields, supplemented by the calculated ones. Then a terrible headache begins. So, in AsterixDB there is a full-fledged query language, with all the functionality of SQL, only developed specifically for poorly structured data. This is a simplification of the functional query language XQuery, which is used in XML DBMS. I will not send readers to bot W3C XQuery specification, although if there is a desire, well, then! I'll try to write a mini-tutorial on AQL (Asterix Query Language).

The core of the query language is the FLOWR construct (almost a flower): For-Let-OrderBy-Where-Return. We'll also put GroupBy here, but a little later. Translating it to SQL, we get:

For is practically FROM clause in SQL, here we select collections for which we run. After For, we get a table of variables with their values, on top of which other operations are applied, almost as in SQL.

For example: after

for $x in users, $y in groups

we receive records in the form:

($x : user1, $y : group1), ($x : user1, $y: group2), ...

That is the usual cross-product in SQL.

Let is an additional clause, here you can enter new variables. Here new tuples are not added to the For result, but simply new variables and their values are added.

OrderBy - everything is simple, the equivalent of SQL sorting.

Where - again, the usual filter, the full analogue of SQL Where.

Return - here we ask what we want to return. Unlike SQL, where we always return a list of columns, here you can fence any JSON structure. And in this clause, bang on the go with inquiries that generate different pieces of the result.

I hope you are not upset all of the above, let's look at a few examples. First, the most primitive:

 for $ user in dataset FacebookUsers
 where $ user.id = 8
 return $ user

Just sampled the FacebookUsers collection, this is equivalent to Mongo: db.FacebookUsers.find ({"id": 8})

Scribbles more, but this is for simple queries. When the tin starts, it will be much easier to understand the request.

Now the request is more interesting, here we will make a join and create a new data type at the output:

 for $ user in dataset FacebookUsers
 for $ message in dataset FacebookMessages
 where $ message.author-id = $ user.id
 return
   {
     "uname": $ user.name,
     "message": $ message.message
   };

It seems everything is obvious, is not it? For run through pairs of users / messages, where sets the join condition, return creates a new JSON object with the uname and message fields.

Now let's group all the messages of one user in one JSON object, with the uname field:

 for $ user in dataset FacebookUsers
 let $ messages: = 
     for $ message in dataset FacebookMessages
     where $ message.author-id = $ user.id
     return $ message.message
 return
   {
     "uname": $ user.name,
     "messages": $ messages
   };

Here we fouled up the subquery in let and assigned the list of all messages of the user to the variable $ messages. Another option to achieve the same:

  for $ user in dataset FacebookUsers
 return
   {
     "uname": $ user.name,
     "messages": 
               for $ message in dataset FacebookMessages
               where $ message.author-id = $ user.id
               return $ message.message
   };

I personally like the second request form more, instead of preparing the data “in advance” and inserting it into the object, we immediately write the embedded expression.

Also in AQL there is a GroupBy construct, but in fact it is replaced by nested queries and is not required (although Having a useful thing happens). On the other hand, most likely query optimization will be better with GroupBy, but this is a question of efficiency.

In essence, we covered the basics of AQL, we will write the last query:

  
   for $ user in dataset TwitterUsers
   distinct by $ user.name
   order by $ user.followers_count desc
   limit 2
   return $ user

Here we found several standard SQL tokens — distinct, order by, limit. It seems obvious what the query is doing: first it removes duplicates by user name, sorts, gives the first 2 values and forms the result.

What have you forgotten? Aggregates? Everything is quite simple here - an aggregate, it is just a normal function in AQL that takes as input a list of values. AsterixDB includes all familiar aggregates, and at once in 2 variants: fully SQL compatible and more human (who remembers how SQL handles NULL inside aggregates?). And of course you can write your own.

What else is good about AsterixDB? The basis of the query processing system - the Algebrix interlayer - is a flexible algebraic system that will allow in the future to write a high-quality cost-based query optimizer for a distributed DBMS. So far, the optimization at the initial stage of development seems to be picking up the indices well, but there are still no statistics and true optimization (although it is possible to achieve acceptable results with hints in queries). There are several types of indexes too - B-Tree, keyword - inverted lists for full-text search, R-Tree for geometry, n-gram for similarity search.

AsterixDB is written in Java, it is already possible to add your own UDF (User Defined Function) or to Java or Jython for those who love Python.

Constercy in Asterix is quite weak, ACID at the level of individual documents, as in MongoDB. There are no plans to support transactions involving a series of documents or collections. Yes, like it suits everyone.

Well, now I hope a spark of hope, that it will be possible in principle to move from MongoDB to a more normal platform, it is sown, it remains for the developers to wait until the combat look.

What is left to finish?

Real replication with fault tolerance has not been completed yet. There is no streaming of results from the API base (inside this is normal, but for external requests, the system “prepares” all the results for itself and sends it later via the REST API). There are no client libraries yet. A lot of work on the optimization of requests remained, starting with the collection of statistics. Well, and probably a sea of small tasks.

So if you do not want to wait on the sidelines, you can roll up your sleeves and help the guys finish this cool piece.

Asterix DBMS .

Source: https://habr.com/ru/post/222083/

All Articles

Down with the shackles of MongoDB

More articles: