How I refused db4o in an industrial system

We are a large company department developing an important system in Java SE / MS SQL / db4o. For several years, the project switched from a prototype to industrial operation and db4o turned into a calculation brake, I wanted to switch from db4o to modern noSQL technology. Trial and error led far from the original plan - db4o was successfully abandoned, but at the cost of a compromise. Under the cat reflections and implementation details.

Is db4o technology dead?

On Habré it is possible to find not so many publications about db4o. On Stackoverflow, there’s some residual activity like a fresh comment on an old question or a fresh unanswered question . Wiki generally believes that the current stable version is dated 2011.

This forms a general impression: the technology is irrelevant. There was even official confirmation : Actian decided not to actively pursue and promote the commercial db4o product offering for new customers any longer.

How did db4o get calculated

The article Introduction to Object Oriented Databases talks about the main feature of db4o - the complete absence of a data scheme. You can create any object

User user1 = new User("Vasya", "123456", 25);

and then just write it to the database file

 db.Store(user1)

The recorded object can then be retrieved using the Query.execute () method in the form in which it was saved.

At the start of the project, this made it possible to quickly ensure the display of the audit trail with all the submitted data, without bothering with the structure of relational tables. This helped the project survive. Then there were few resources in the sandbox, and immediately after the end of today's calculation, data for tomorrow began to load in MS SQL. Everything was constantly changing - go figure out what exactly was automatically served at night. And the db4o file can be accessed
in debugging, extract a snapshot of the desired day and answer the question "we submitted all the data, but you did not order anything."

Over time, the problem of survival disappeared, the project took off, the work with user requests has changed. Open a db4o file in debugging and parse a difficult question can a developer who is always busy. Instead, there is a crowd of analysts armed with a description of the order logic and able to use only the user-visible piece of data. Soon db4o began to be used only to display the history of the calculation. Just like Pareto - a small part of the capabilities provides the main load.

In combat operation, the history file takes ~ 35 GB / day, unloading takes about an hour. The file itself compresses well (1:10), but for some reason the com.db4o.ObjectContainer library does not perform compression. In the north of CentOS, the com.db4o.query.Query library writes / reads a file exclusively to one stream. Speed is a bottleneck.

Schematic diagram of the device

The information model of the system is the hierarchy of objects A, B, C and D. The hierarchy is not a tree; links C1 -> B1 are required for operation.

 ROOT || | ==>A1 | || | | ==> B1 <------ | | || | | | | ======> C1 | | | | | | | ===> C1.$D | | =======> C2 | | | | ==> B2 ==> C2.$D | | ===>A2 =======> C3 | | ==> B3 ===> C3.$D | ======> C4 | ===> C4.$D

The user interacts with the server through the user interface (GUI), which is provided by com.sun.net.httpserver.HttpsServer, the client and server exchange XML documents. At the first display, the server assigns an identifier to the user level, which does not change further. If the user needs a history of some level, the GUI sends the server an XML-wrapped identifier. The server determines the key values for the search in the database, scans the db4o file for the desired day and retrieves the requested object in memory plus all the objects to which it refers. Builds an XML presentation of the extracted level and returns it to the client.

When scanning a file, db40 by default reads all child objects to a certain depth, extracting a rather large hierarchy together with the desired object. Reading time can be reduced by setting the minimum activation depth for the unnecessary Foo class with conf.common (). ObjectClass (Foo.class) .maximumActivationDepth (1).

Using anonymous classes leads to the creation of implicit references to the enclosing class this $ 0 . Db4o processes and restores such links correctly (but slowly).

0. Idea

So, admins have a strange expression on their faces when it comes to supporting or administering db4o. Data extraction is slow, the technology is not very lively. Task: instead of db4o, apply the current NoSQL technology. A pair of Spring Data + MongoDB caught my eye.

1. Frontal approach

My first thought was to use org.springframework.data.mongodb.core.MongoOperations and the save () method, because it looks like com.db4o.ObjectContainer.db.Store (user1). The MongoDB documentation says that documents are stored in collections, it is logical to present the necessary system objects as documents of the corresponding collections. There are also @ DBRef annotations that allow you to implement relationships between documents in general in the spirit of 3NF . Go.

1.1. Unloading. Key Reference Type

The system consists of POJO classes designed long ago and without taking into account all these new technologies. Fields of type Map <POJO, POJO> are used, there is a branched logic of working with them. I save this field, I get an error

 org.springframework.data.mapping.MappingException: Cannot use a complex object as a key value.

On this occasion, only correspondence in 2011 was found , in which it was proposed to develop non-standard MappingMongoConverter. Noted so far the problem fields @ Transient, I’m going on. It turned out to save, studying the result.

Saving occurs in the collection, the name of which coincides with the name of the saved class. I haven’t used @DBRef annotations yet, so there is only one collection, JSON documents are quite large and branched. I notice that when you save the object, MongoOperations goes through all (including inherited) non-empty links and writes them as an attached document.

1.2. Unloading. Named field or array?

The system model is such that class C may contain a reference to the same class D several times. In a separate defaultMode field and among other links in ArrayList, something like this

 public class C { private D defaultMode; private List<D> listOfD = new ArrayList<D>(); public class D { .. } public C(){ this.defaultMode = new D(); listOfD.add(defaultMode); } }

After unloading, the JSON document will have two copies of it: an attached document named defaultMode and an unnamed element of the document array. In the first case, the document can be accessed by name, in the second - by the name of the array with an index. You can search MongoDB collections in both cases. Working only with Spring Data and MongoDB, I came to the conclusion that you can use ArrayList, if carefully; I did not notice any restrictions on the use of arrays. Features appeared later, at the MongoDB Connector for BI level.

1.3. Loading. Constructor arguments

I am trying to read a saved document using the MongoOperations.findOne () method. Loading object A from the database throws an exception

 "No property name found on entity class A to bind constructor parameter to!"

It turned out that the class has a corpName field, and the constructor has the String name parameter, and this.corpName = name is assigned in the constructor body. MongoOperations requires that the field names in the classes match the names of the constructor arguments. If there are several constructors, you need to select one with the @PersistenceConstructor annotation. I bring the names of the fields and parameters into correspondence.

1.4. Loading. With $ D and this $ 0

The inner nested class D encapsulates the default class C behavior and does not make sense separately from class C. An instance D is created for each instance of C and vice versa - for each instance of D there is an instance of C that generated it. The class D still has descendants that implement alternative behaviors and can be stored in listOfD. The constructor of the descendant classes of D requires the presence of an already existing object C.

In addition to nested inner classes, the system uses anonymous inner classes. As you know , both of them contain an implicit reference to an instance of an enclosing class. That is, as part of each instance of the CD object, the compiler creates a link this $ 0, which points to the parent object C.

Again I try to read a saved document from the collection and get an exception

 "No property this$0 found on entity class $D to bind constructor parameter to!"

I recall that the methods of class D use the full force of C.this.fieldOfClassC references, and the descendants of class D require the constructor to have an instantiated C instance as an argument. That is, I need to provide a certain order of creating objects in MongoOperations so that the parent object C can be specified in the D constructor. Again, non-standard MappingMongoConverter?

Maybe not using anonymous classes and making inner classes normal? Refining, or rather refining the architecture of an already implemented system is a wow task ...

2. Approach from 3NF / @ DBRef

I try to go on the other hand, save each class in my collection and make connections between them in the spirit of 3NF.

2.1. Unloading. @DBRef is beautiful

Class C contains several references to D. If the defaultMode and ArrayList links are marked as @DBRef, the document size will decrease, instead of huge attached documents there will be neat links. In field json document of collection C field appears

 "defaultMode" : DBRef("D", ObjectId("5c496eed2c9c212614bb8176"))

In the MongoDB database, a collection D is automatically created and a document with a field in it

 "_id" : ObjectId("5c496eed2c9c212614bb8176")

Everything is simple and beautiful.

2.2. Loading. Class D constructor

When working with links, the C object knows that the default D object is created exactly once. If you need to bypass all D objects except the default one, just compare the links:

 private D defaultMode; private ArrayList<D> listOfD; for (D currentD: listOfD){ if (currentD == defaultMode) continue; doSomething(currentD); }

I call findOne (), I study my class C. It turns out that MongoOperations reads a json document and calls the D constructor for each @DBRef annotation it encounters, each time creating a new object. I get a strange construct - two different references to D in the defaultMode field and in the listOfD array, where the link should be the same.

Learning from the community : "Dbref in my opinion should be avoided when work with mongodb." Another consideration in the same vein from the official documentation: the denormalized data model where related data is stored within a single document will be optimal to resolve DBRefs, your application must perform additional queries to return the referenced documents.

The mentioned documentation page says at the very beginning: "For many use cases in MongoDB, the denormalized data model where related data is stored within a single document will be optimal." Is it written for me?

Focus with the designer suggests that you don’t need to think like in a relational DBMS. The choice is:

if you specify @DBRef:
- the constructor for each annotation will be called and several identical objects will be created;
- MongoOperations will find and read all documents from all related collections. There will be a request to the index by ObjectId and then reading from many collections of a (large) database;
if not specified, then "abnormalized" json will be saved with repetitions of the same data.

I note for myself: you can not rely on @DBRef, but use a field of type ObjectId, filling it manually. In this case, instead of

 "defaultMode" : DBRef("D", ObjectId("5c496eed2c9c212614bb8176"))

json document will contain

 "defaultMode" : ObjectId("5c496eed2c9c212614bb8176")

There will be no automatic loading - MongoOperations does not know in which collection to search for a document. The document will need to be loaded in a separate (lazy) request indicating the collection and ObjectId. A single query should return the result quickly, in addition, an automatic index is created for each collection by ObjectId.

2.3. So what now?

Subtotals. It was not possible to quickly and easily implement the db4o functionality on MongoDB:

It is unclear how to use a custom POJO as a Key - Value list key;
it is unclear how to set the order of creating objects in MappingMongoConverter;
it is unclear whether to upload an “abnormalized” document without DBRef and whether it is necessary to invent its own mechanism for lazy initialization.

You can add lazy loading. You can try to do MappingMongoConverter. You can modify existing constructors / fields / lists. But there are many years of layering of business logic - not a weak alteration and the risk of never being tested.

Compromise solution: to make a new mechanism for saving data for the problem being solved, while retaining the mechanism for interacting with the GUI.

3. The third attempt, the experience of the first two

Pareto suggests that solving problems with the speed of users will mean the success of the whole task. The task is this: you need to learn how to quickly save and restore user presentation data without db4o.

This will lose the ability to examine the saved object in debugging. On the one hand, this is bad. On the other hand, such tasks rarely occur, and in git all combat deliveries are tagged. For fault tolerance, each time before unloading, the system serializes the calculation to a file. If you need to examine an object in debugging, you can take serialization, clone the corresponding system assembly and restore the calculation.

3.1. Custom Presentation Data

To build presentations of user levels, the system has a special Viewer class. The Viewer.getXML () method receives a level as an input, extracts the necessary numeric and string values from it, and generates XML.

If the user asked to show the level of today's calculation, then the level will be found in RAM. To show a calculation from the past, the com.db4o.query.Query.execute () method will find the level in the file. The level from the file is almost no different from the level just created and Viewer will build the presentation without noticing the substitution.

To solve my problem, I need an intermediary between the calculation level and its presentation - the presentation framework (Frame), which will store data and build on the available XML data. The chain of actions for building the presentation will become longer, each time a frame will be generated and the frame will generate XML:

  : < > -> Viewer.getXML() : < > -> Viewer.getFrame() -> Frame.getXML()

When saving the story, you will need to build frames of all levels and write to the database.

3.2. Unloading

The task turned out to be relatively simple and there were no problems with it. Repeating the structure of the XML presentation, the frame received a recursive device in the form of a hierarchy of elements with the fields String, Integer and Double. The frame requests getXML () from all its elements, collects it into a single document, and returns. MongoOperations did a great job with the recursive nature of the frame and did not ask new questions as it progressed.

Finally, everything took off! The WiredTiger engine by default compresses MongoDB document collections; on the file system, unloading took ~ 3.5 GB per day. A tenfold decrease over db4o is not bad.

At first, the unloading was arranged simply - a recursive traversal of the level tree, MongoOperations.save () for each. Such unloading took 5.5 hours, and this despite the fact that building presentations involves only reading objects. I add multithreading: recursively traverse the level tree, split all available levels into packages of a certain size, create Callable.call () implementations according to the number of packages, transfer each package to our own package and do it all through ExecutorService.invokeAll ().

MongoOperations again asked no questions and did a great job with multi-threaded mode. Empirically selected the size of the package, giving the best speed of unloading. It turned out 15 minutes for a package of 1000 levels.

3.3. Mongo BI Connector, or how people work with it

The MongoDB query language is large and powerful, I inevitably gained experience working with it, reaching this place. The console supports JavaScript, you can write beautiful and powerful designs. This is one side. On the other hand, I can break the brains of a good half of fellow analysts with a request

 db.users.find( { numbers: { $in: [ 390, 754, 454 ] } } );

instead of the usual

 SELECT * FROM users WHERE numbers IN (390, 754, 454)

MongoDB Connector for BI comes to the rescue, through which you can present collection documents in tabular form. The MongoDB database is called a document-based database, it does not know how to present a hierarchy of fields / documents in a tabular form. For the connector to work, it is necessary to describe the structure of the future table in a separate .drdl file, the format of which is very similar to yaml. The file must contain the correspondence between the field of the relational table at the output and the path to the field of the JSON document at the input.

3.4. Features using arrays

It was said above that for MongoDB itself there is no special difference between an array and a field. From a connector perspective, an array is very different from a named field; I even had to refactor the finished Frame class. An array of documents should be used only when it is necessary to place part of the information in a linked table.

If the JSON document is a hierarchy of named fields, then any field can be accessed by specifying the path from the document root through a dot, for example xy. If the correspondence xy => fieldXY is specified in the DRDL file, then the output table will have as many rows as there are documents in the collection at the entrance. If there is no xy field in any document, NULL will be in the corresponding row of the table.

Suppose we have a MongoDB database called Frames, there is a collection A in the database, and MongoOperations has written two instances of class A to this collection. The following documents have turned out: first

 { "_id": ObjectId("5cdd51e2394faf88a01bd456"), "x": { "y": "xy string value 1"}, "days": [{ "k": "0", "v": 0.0 }, { "k": "1", "v": 0.1 }], "_class": "A" }

and second (ObjectId differs by the last digit):

 { "_id": ObjectId("5cdd51e2394faf88a01bd457"), "x": { "y": "xy string value 2"}, "days": [{ "k": "0", "v": 0.3 }, { "k": "1", "v": 0.4 }], "_class": "A" }

The BI connector is not able to access the elements of the array by index, and it is simply impossible to extract, for example, the days [1] .v field from the array into the table. Instead, the connector can represent each element of the days array as a row in a separate table using the $ unwind operator. This separate table will be associated with the original one-to-many relationship through the row identifier. In our example, tables tableA are defined for collection documents and tableA_days for documents of the days array. The .drdl file looks like this:

 schema: - db: Frames tables: - table: tableA collection: A pipeline: [] columns: - Name: _id MongoType: bson.ObjectId SqlName: _id SqlType: objectid - Name: xy MongoType: string SqlName: fieldXY SqlType: varchar - table: tableA_days collection: A pipeline: - $unwind: path: $days columns: - Name: _id #   MongoType: bson.ObjectId SqlName: tableA_id SqlType: objectid - Name: days.k MongoType: string SqlName: tableA_dayNo SqlType: varchar - Name: days.v MongoType: string SqlName: tableA_dayVal SqlType: varchar

The contents of the tables will be: table tableA

_id	fieldXY
5cdd51e2394faf88a01bd456	xy string value 1
5cdd51e2394faf88a01bd457	xy string value 2

and table tableA_days

tableA_id	tableA_dayNo	tableA_dayVal
5cdd51e2394faf88a01bd456	0	0.0
5cdd51e2394faf88a01bd456	one	0.1
5cdd51e2394faf88a01bd457	0	0.3
5cdd51e2394faf88a01bd457	one	0.4

Total

It was not possible to implement the task in the original formulation; you cannot just take and replace db4o with MongoDB. MongoOperations is not able to automatically restore any object like db4o. You can probably do this, but labor costs will not be comparable to calling the store / query methods of the db4o library.

Audit trail. Db4o is a very useful tool at the start of a project. You can simply write the object, then restore it and at the same time no worries and tables. All this with an important caveat: if you need to change the hierarchy of classes (add class E between A and B), then all previously stored information becomes unreadable. But for starting a project this is not very important, as long as there is no large accumulated array of old files.

When there was enough experience with MongoOperations, writing the upload did not cause problems. Writing a new code for the framework is much easier than redoing the old one, which is also put on production.

Source: https://habr.com/ru/post/461417/

All Articles