Meta data. Towards ideals of data model management

What is this post about

This is a post-review of options for managing data models known to the author based on experience, rumors, and reading instructions.
Also this post is an attempt to classify existing options for managing data models.
Finally, the idea and initial touches are given in the implementation of the data model management system, which should not contain the flaws of the previous ones.

Definitions and limitations

It is assumed that the reader is (or someday becomes) an Enterprise Application developer, who often needs to write quickly and efficiently, but who is not afraid to climb into the wilds of JPA / JTA / RMI in order to “tweak” extra thin files with a file.

Data is what is stored in the application database. Information about customers, users, orders, etc.
')
Metadata - a description of the data structure. Description of what types of objects are stored in the database, what fields they have (attributes, elements), description of dependencies between objects. In general, types can inherit attributes of the parent type, and one attribute can generally be present in two or more types unrelated by the inheritance relation.

Enterprise Application works using (most often) Application Server (WebLogic, JBOSS) and some RDBMS (Oracle, Informix, MySQL). Although the author does not see anything shameful in self-assembling an AS based on Tomcat / Hibernate / JOTM / DBCP / etc, this is very, very interesting, but beyond the scope of this topic.

As RDBMS, it is assumed that one of those standard ones is supported by Hibernate / OpenJPA.

The topic uses terms from XML Schema: namespace, type, attribute. The last two to some extent correspond to the concepts of Java class (class object, bin) and property (property, aka get + set, sometimes also just a field, field).

Introduction The simplest case

Large applications - most often it is not only applications with a large amount of data. Most often these are applications that work with a large number of heterogeneous data that have a different structure in terms of business logic. (By the way, the last is important - the data structure can be different at the DBMS level, at the application level and even inside it)

The simplest case is the definition of a data model as a set of classes and corresponding to it a set of tables in the database. Roughly speaking: one class - oda table in the database. Each property of an object is represented by a property of a bean class and a column in the database. However, this mechanism has drawbacks that manifest themselves in the development and use of Enterprise applications:

Adding or changing a data model will require changes in both the database structure and program code, followed by recompilation, etc.
As a result, it is impossible to do this "on the fly"
Difficult changes, such as moving an attribute from a child type to its parent type, will also require writing hand scripts (DDL + DML) to update the database structure.
Changing the structure requires SQL / Java knowledge from a specialist.

It is necessary, however, to note the advantages of this approach:

Best price for good performance. Literally “out-of-the-box” we get the most transparent data storage structure, the most obvious for both the JPA layer (hibernate, etc) and the RDBMS (and its administrator).
From the point of view of a business logic programmer who does not change the data structure, we also get the most convenient API out-of-the-box

Notice in the last sentence an important clarification - "business logic". We are talking about the description of the processes of interaction of data structures, their change, etc. - that is, the code that should know and knows about the data structure. But if, for example, we are talking about editing bins via the WEB interface (or in any other way), then to write an editor that can edit 80% of objects without knowing their structure (so-called generalized), we will have to deal with Reflection / Beans / etc and other, in principle, not very scary words. (Scary - at the end of the topic).

Modern design tools allow you to automate part of the processes associated with the update, for example, the database structure by code, or vice versa - to generate or update the code according to the description of the data structure. Not sure, but I think there are tools for creating both the code and the database structure on the basis of some abstract data schema written, for example, in the form of an XML Schema. (The code can be generated so accurately - see XML Beans, etc.). However, all these tools work in “offline” and do not affect the running application (unless you, of course, do the update directly on the “live”, but nothing good happens).

By the way, some of the auxiliary utilities can be forced and the molds for each type of objects to draw.

Flexible data structures

The most flexible can be considered a structure in which each object is stored as a record in the database in the form, well, for example, XML. That is, a large-large table in which two columns are the object ID and its content in the form of XML. As you correctly guess, the main drawback of this structure is the very low database performance at the moment when we need to calculate, for example, all the clients from the city of Moscow. To do this, you have to parse the database every value.

In order for the structure to remain flexible, but to load the database less, the object is divided into pieces and put into separate tables. For example,
- Objects: ID, mandatory field 1, mandatory field 2
- Values: object ID, attribute identifier, value

You can go further and, without limiting flexibility, divide the attributes of different types into different tables or columns. A similar scheme is successfully used in the application (cut) for processing data in several terabytes.

More disadvantages:
For flexibility you have to pay. First, the data manipulation layer will have to be written independently. Secondly, there is a great desire to save money and leave for the business logic API, which would reflect the structure of the database:
- give object ID such
- give the attribute ID such
- update value
- write down the ID attribute of such an object
- update object version (+1)

Of course, from the point of view of the programmer of the generalized data editor, it is very convenient to have methods like getAllAttributes (). However, from the point of view of business logic, this is inconvenient, especially if you need to remember all the IDs of the necessary attributes (they can also be numeric).

It should be noted, however, that the API in general does not have to coincide with the structure of the database. The main thing is that 80% of actions are performed in the simplest and most obvious way. That is, if we have clients in the database, getting the name of the client or its address should be one line of code like client.getAddress (). However, for flexible structures, writing such shells can severely undermine performance, and secondly, structures tend to change ...

However, if such a shell is not written by someone who is responsible for writing data access procedures, be prepared that in a couple of years you will have as many "simplified" data access shells as many initiative programmers work with the "standard" API.

Structures with disabilities

In this section, I would like to talk about one more approach, which is used in one little-known CMS .

From a code point of view, access to the attributes of an object is done in the same way as with flexible structures — through methods like getAttribute / getAllAttributes / etc. However, for CMS, the main task of which is to edit objects separately (without relations between objects), as well as simply output the object to XML for further processing - this API is enough.

Interestingly, the list of data types is stored in a configuration file. Also in this file for each type is stored a list of attributes and their type. Based on the configuration file, the structure of the tables is created or updated at startup. Later on, on the fly, as the data structure changes, the tables are updated.

Pros:
- obvious data model for a DBMS
- flexibility on the fly
Minuses
- from the point of business logic API is too flexible (see previous section)
- you need to write your data access system, which at the moment, unfortunately, unlike system objects (users, groups, etc) ignores transactions, caches and other delights

Classification ... attempt

If we consider a meta model, then to describe it we need to answer the following questions:

What is the starting point for describing a model? (Of course, it should be one point) Where is information about objects and their attributes stored?
How is the data storage in the database organized?
- Are the requirements of the first normal form satisfied?
- Two different simple (non-multiple) attributes are stored:
  - in the form of two different columns
  - in two different lines
How is data access organized at the Application Server level?
- standard JPA methods are used (EntityManager, etc)
- using their data access classes
How is data access organized at the level of business logic?
- standard methods like getName (), getAddress (), etc are used
- used non-standard API like getAttribtute (ID ...)
Is there access to meta-data from the program?
- there is, and you can even change
- there is
- only through Reflections / Hibernate Mapping / etc

I want ... perfect for the author

From the previous paragraph, the requirements for an ideal (from the point of view of the author) system for describing and operating data models are easily derived:
- the description of the data structure should be in the database, which will allow you to quickly change the description of the model, perhaps through the application itself
- the data itself should be stored in a normalized (up to 3-4 forms) database, where each type has its own data table. The management system itself must take care of maintaining the database schema in accordance with the meta-data.
- data access should be through standard JPA / EntityManager interfaces.
- from the business logic point of view, the main fields of the main object types should be accessible through a simple API without additional resolving / casting / narrowing (that is, immediately after loading from the EntityManager)
- but the system must also provide access to metadata. Including for a specific object - getting a list of all fields.

Currently, the author is engaged in writing such a system using:
- Hibernate - as a data access driver
- CGLIB / ASM - for dynamically constructing classes based on their description, including annotations for Hibernate
- XML Schema - for describing data types and their attributes

But more about that next time.

Source: https://habr.com/ru/post/90528/

All Articles