Within the framework of one project, the task of long-term storage of logically related data objects with the provision of multi-user access to their contents was assigned. There are various ways to meet this need by means of already existing data management systems. Nevertheless, a search was undertaken for a simple and productive solution, the results of which are offered for consideration.
This article discusses the general logic of data management, without immersing in the details of the software implementation, often self-evident.
Under the terms of the task, the object database management system, or rather the part of it that is responsible for multi-user access, operates with a homogeneous set of isolated objects. Note that the unified form of an object can take in general a wide variety of informational entities: data, meta-data, lists, transactions, script resources, documents, and others.
Data object
Initially, the object is known only that it is serialized, for long-term storage on disk, and consists of two parts: the header and the actual content.
')
The title of the object has a fixed length, and is necessary for placing service information in it. In particular, the header stores the full length of the object in bytes, its own descriptor, and state number.
A priori an object contains a set of values that are identified by their ordinal number in the set. The object itself does not interpret its values in any way, but “knows” that each value is characterized by a length in bytes, which makes it possible to calculate the size of the object. The value set exists in a tuple format.
Identification and access
To store objects, a conditionally infinite file space is used, logically divided into clusters. In the file storage, each object occupies one or more consecutive clusters. The sequence number of the first cluster is used as an
FP (File Pointer)
file pointer to place an object in the storage.
For long-term storage of file pointers, the Data Allocation Table (
DAT) allocation table is used, which is a simple dynamically extended array of integer pointers. DAT cell indices are used as
system identifiers for IDO objects . When a new object is created, it is allocated the next free cell DAT, the index of which becomes the permanent object identifier. This identifier is a unique and global handle to an object within a physical database.
When the system is started, the DAT is loaded from the storage into RAM, and is used to organize quick access to the object using its IDO according to the following scheme:

If the value extracted from the DAT is a file pointer, then the object is loaded from storage into memory —
Cache objects , and the contents of the DAT cell are replaced with a pointer to memory
A * . The reverse replacement occurs when the object is preempted from memory.
Note that the pointer to the memory
A * is not an absolute address, but only an offset from the beginning of the
Cache , but it points directly to the contents of the object. The title of the object, as well as service fields, intended for temporary storage of FP and linking objects into chains, are located relative to
A * with a negative offset. It is noteworthy that the value of
A * is also used as an identifier of an object in memory.
Cache objects
It is a contiguous memory area statically allocated during system initialization. The required size is set optionally.
The main tasks of
Cache are to be as full as possible, and quickly allocate the space required for object placement. For these purposes, the chain of free memory blocks is ranked, and the pushing out of the next unused objects occurs only when this is the only way to get free space of the required size. When allocating memory, the object automatically takes into account the necessary reserve for placing service fields. And for organizing the management of free memory, it is also used.
States and Transactions
In the absence of external influences, the full set of objects that make up the database keeps its
state unchanged for an infinitely long time. Any action to retrieve the contents of objects that does not change the state of the database is further understood as a
sample .
An external action that changes the state of the database is treated as a
transaction .
A transaction creates new objects or modifies the contents of existing ones. At the same time, the following mechanism for making changes is involved: a copy of the object is first created, on the rights of its older version, to which these changes are made. The combination of newly created objects and modified copies of objects forms a set
of transaction objects . Accordingly, the new state of the base is the
objects of the transaction + the
objects of the previous state , with ignoring the younger versions of the objects. De facto, the sequence number of the transaction is the base state number.
In a multi-user data access environment, certain efforts are required to preserve the logical integrity of the data, both during the execution of transactions and during sampling.
Data integrity
The concept of
transactional integrity is well known —
all or nothing .
The new state of the database will be formed only in case of successful completion of the transaction. At the same time, transaction objects become publicly available. Committing a new state occurs when the user terminates a
transactional session , which he must open before starting the execution of the transaction. But until the transactional session is completed, the objects it generates or modifies is available exclusively to the user who opened the session. Any interruption of a transactional session, regardless of the reason for the interruption, will entail simple destruction of the objects generated by the session.
In addition to the above, one more obvious rule should be taken into account: a transaction
started later than the previous one cannot be
completed before the completion of the previous, more priority transaction.
The need to respect “
consistent ” data integrity is not nearly as obvious.
The operational financial report of the company is being formed, for which a very extensive data sample is made in terms of volume and lengthy in time. At the same time, the database continuously changes its state under the influence of the transaction flow. If you do not impose restrictions, then there is not a zero chance that the balance of the report will not converge, since only a part of the generally correct and successfully completed posting hit the sample (randomly at the intersection of time intervals). To prevent such a collision, it is necessary to follow a simple rule - any data sampling should be carried out with the base state unchanged. The state fixes the state by opening the
sample session . Within this session, all subsequent database states, that is, higher versions of objects, are ignored.
Thus, at each moment of time a single user either does nothing or is in the process of executing one of two sessions: transactional or sampling.
State objects
At least one database state -
background , always relevant. The background state is formed by a complete set of objects directly addressable from the DAT, of which a part remains on the disk and a part is loaded into memory.
The dynamics of the multi-user process is such that users, executing transactions, generate a sequence of new
temporary states of the database. Upon successful completion of the next transactional session, the temporary state generated by it becomes publicly available. When you open a fetch session, the user is presented with the highest number available state. Having existed for some time, temporary states that are no longer used for the purposes of sampling are successively absorbed by the background state.
Any state, including the background state, owns some set
of state objects that are connected in the same chain. Note that the temporary state objects are the transaction objects mentioned above that became relevant as a result of the successful completion of the transactional session. In the chain of the background state, the objects are ordered by decreasing time of their non-use. Referring to the object for its contents automatically moves the object to the end of the chain. The objects at the beginning of the chain are candidates for ousting
Cache from memory.
Earlier it was mentioned that an attempt to change an existing object automatically generates its new version. Thus, several versions of the same object can be simultaneously in memory. These versions are connected in the same chain. The pointer (
A * ) to the first object in the version chain is in the DAT, and the chain itself allows the user to access the "correct" version of the object in the desired state. In this case, the correct version (relevant from the point of view of the user) is the version with the highest state number that does not exceed the required one.
The distribution over the states of objects connected in chains of states and versions looks like this:

The process of absorption of the state is initiated by the last one who used it at the end of the session. When absorbing a temporary state, the background state (Background) first removes from memory the outdated versions of objects for which there is a new version in the absorbed state, after which it simply adds a chain of objects of the next state to the absorbed state.
The
ST state table (States Table) is used to manage multi-user access, database states, and their objects.
State table
Each ST record contains a pointer (
A * ) to the first object of the chain of state objects, user IDs and object of the lock, as well as a count of users using this state.
In relation to the table ST, there are three external pointers that operate with the full state number of the database. If the table size is a multiple of the power of two, then using the lower digits of the absolute number as an index to the table. ST provides a circular movement of pointers over the table.
The Background State (
BS ) pointer contains the number of the background state. When absorbing a subsequent temporary state, the BS pointer is incremented. The condition for absorption is the zero value of the counters of the use of two states at once: the background one and the next one. The condition is checked when closing any of the sessions, after the decrement of the usage counter.
The Last Available State (
LS ) pointer contains the highest available state number. This number is provided to the user when he opens the session session. When the next transactional session is closed, the LS pointer is incremented, automatically obtaining the number of this session.
The Next State (
NS ) pointer provides the state number to the user opening the transaction session, and then incremented. If there are no open transactional sessions, then the value of NS exceeds the value of LS by 1. If there are no temporary states, then the values of the BS and LS pointers coincide.
The status number received by the user when opening any session is stored in the corresponding
CT client record (Client Table). All user calls to the object service API include the client ID, and the rest of the data is retrieved from the corresponding CT record.
Customer table
The client's system identifier is the ordinal number of the record allocated to it in the Client Table during authorization. This table registers both system resources allocated to the client: TCP socket and flow descriptors, as well as resources used by it in the data management system, and in particular, the number of the state opened by the user, as well as various control flags.
Conflict resolution
Recall: transactions, regardless of their duration and result, must be completed strictly in the order in which they were started, in ascending order of their own numbers. To organize such a sequence, a pool of named blocking objects is used, which are created together with the ST table during system initialization.
Immediately when opening a transactional session, a free lock object is requested from the pool, which is immediately captured by the user thread and held by it until the session is completely completed. The identifier of the captured lock object is stored in the corresponding state of the ST record. After that, the record of the previous state is checked for the presence of an incomplete transaction in the form of the actual identifier of the lock object.
With parallel execution of transactions, there is always the unpleasant probability that an earlier transaction will change the contents of an object after a subsequent transaction uses the same object for its own purposes. The overhead of permanently tracking such a conflict is huge. And its resolution is possible only by re-execution of all subsequent transactions.
If there is a previous incomplete transaction, the current one, trying to avoid conflict, changes the logic of its execution. Recall that the disk access is the longest of all operations performed by the service objects. Therefore, while the previous transaction is executed, the current one only simulates its execution — without actually creating copies of the object and changing their contents. In this case, all objects that were somehow used by the transaction are guaranteed to be loaded into Cache. When the simulation execution is completed, the transaction re-checks the ST record of the previous state. If the object identifier of the lock is obtained from it, then the transaction “freezes” in an attempt to capture this object. After the blocking object is released by the previous transaction, the current one will continue its execution, but now in normal mode and with minimal execution time.
Abnormal situations
If something went wrong (for example, a fatal error in the service of objects or a hardware failure), then the auto-recovery from the checkpoint can save the database. In the more mild case, when the client's flow “fell and did not wring out”, or went to infinity, leaving its session open, a complete stop of the state conveyor can be prevented.
The hang of the transactional session will be detected when a new transaction cannot receive a free object from the pool, which for these very purposes contains half as many lock objects as the ST table of records. In this case, the problem is the state with the index [LS + 1].
The freezing of the sampling session will be detected only when all free ST records have been exhausted, that is, when the BS and NS indices become equal. The hanging state with the index [BS] or [BS + 1] will have a non-zero value of the usage counter.
Regardless of the cause of the session crash, its consequences are always the same: after receiving the identifier of the “hung user”, its flow is forcibly stopped, all resources used are released, the session is forcibly terminated, after which the pipeline is automatically unloaded in normal mode All these actions are performed in the thread of the user who discovered the problem. Upon completion of recovery operations, the thread of the hung user is started again, and it makes at least one attempt to retry the unsuccessful session again. In the case of a second failure, the user starts his thread by waiting for external events. This entire process is governed by the flags set to the user in the Client Table.
Tuple of values
The contents of the object - a set of its values, is stored in a tuple format. The properties of a tuple allow it to be used as a universal, in terms of storage and access, method of organizing data. It is worth mentioning that the memory manager (
MM ) of the object service, which ensures the operation of all its parts, including the Cache objects, is initially focused on supporting the tuple format.
Logically, a tuple is a sequence of elements identified by their ordinal number in a tuple. A tuple element contains some value that is characterized by its length. The length of the value is known a priori, and put into the value header. The tuple is implemented as an array of relative pointers. Each pointer represents the offset of the beginning of the value relative to the beginning of the tuple. Both offset and dimensions are measured in bytes.
The tuple has a number of remarkable properties.
First of all, a value in a tuple may be another tuple. From this point of view, the entire contents of the object can be considered as one value. The length of any value is known from its header, and therefore the number of elements in the tuple is known.
The order of the elements in a tuple is strict and unchanged. The operation "Insert" is prohibited. But you can safely add to the existing set of new items.
In a tuple, an uninitialized value will have a zero offset value, which is what differs from an “empty” value with a zero length. An uninitialized value can be operated without conflicts, including treating it as an “empty” value.
In a tuple, you can place a data structure of arbitrary complexity. Logically, the format of a tuple resembles XML, but only with indices instead of tags and the ability to operate not only with text values. A separate value in a complex structure can be accessed directly, using a sequence of indices (route) as the address. And it is possible and relative to the tuple-owner.
A tuple has the ability to create its own instances.
An instance of a tuple differs from its copy by zero offset values for all its values that are not tuples in turn. In other words, an instance is a copy of a structure.
A tuple of values can exist both in serialized form (a continuous set of bytes for storage on disk), and in an arbitrary one, in which individual values of the tuple are located in different places of RAM. The offset relative to the beginning of the tuple may also have a negative value.
The last two of the listed features of the behavior of a tuple are actively used when modifying an object during a transactional session.
Object change
In fact, a change in an object is a modification of one or several of its values. Although it was previously mentioned that a copy of the object is created to make changes, in fact there is no need to copy the object in full. Instead of a copy of it in Cache, an instance of the object is created with zeroed pointers in the tuple. When accessing such an un-initialized value, the value retrieved from the previous version of the object is returned (save memory).
The new value assigned to the object is also placed in the objects Cache, after which the offset of the new value relative to the object instance is written to the corresponding element of the tuple.
Further, in case of successful completion of the transaction, the transaction objects thus formed must be unloaded onto the disk.
Saving objects
The file space allocated for storing objects is not only clustered, but also divided into data banks of two to the power of N clusters. Each bank occupies one file, which is called the sequential number of the bank. Thus, when accessing an object on disk, its FP is sequentially converted first to the file name, and then to the cluster number in the file.
To minimize disk operations, as well as the time of automatic system recovery after a crash, all objects, regardless of their primary location, are stored in one bank. For these purposes, a contiguous memory area of the appropriate size is reserved (optimally 32 MB), and transaction objects are successively written into this memory bank, right up to its completion. Before recording, the length (in clusters) of all objects is summed up, and if there is not enough free space in the memory bank, a new bank is requested from the system, and the full one is queued to the write stream.
Saving transaction objects is performed when closing a transactional session. The recording of an object into memory begins from the beginning of the next free cluster, and a new FP object is formed. The new FP is not calculated if the previous version of the object is already present in this memory, and its size allows you to write a new version in this place. A full version of the object is written to memory with all the values of its tuple, both modified and borrowed from the previous version. In the process of writing, the object is serialized, with the calculation of new pointers (offsets) in the tuple.
After the session is completed, the modified transaction objects in Cache become publicly available as is, namely, with an incomplete tuple. These objects should replace the previous versions.
Version merge
Naturally, if an object has a subsequent version in memory, then such an object cannot be preempted from Cache in order to obtain free space, even if it is at the very beginning of the displacement chain. For such an object, a different order of repression is envisaged, but for now, instead of it, push out another waiting list.
The previous version of the object (hereinafter, version [0]) is pushed out of memory by its subsequent version [+1], and this happens during the absorption of the temporary state by the background state. And there is one subtlety of execution.
Version [0], usually loaded from disk, occupies a contiguous area in Cache, while the tuple and new values of version [+1] are in different places, reinforcing memory defragmentation. Therefore, the absorption scenario of the [+1] version looks more attractive than the displacement scenario of the [0] version. If the new value of version [+1] does not exceed the existing size, it is simply copied into the body of version [0], and the memory occupied by it is released. Otherwise, this new value remains outside the object as it is, and a new offset is calculated for it, and a flag is set to the object, which requires the normal process of crowding out to analyze the object for the presence of fragments beyond it.
Transaction formalization
During the execution of a transactional session, the transaction is formalized into the object format. The elements of a tuple of this object are formed by elementary actions, such as creating an object or changing one of its values. In the header of the object, among other things, the identifier of the user on whose behalf the transaction was executed, as well as a note about the date / time of the beginning of the transaction. If the transactional session is completed successfully, then when the session is closed, the transaction formalized in this way is put in the write queue.
It is worth noting that the order of queuing a transaction and a bank of objects formed during the closing of one session is determined by whether it was possible to place the transaction objects in this bank. If successful, the formalized transaction is placed in the queue first. If a new bank was opened to accommodate the objects of this transaction, then the object bank is put first in the queue.
A sequence of formalized transactions stored on disk forms a transaction log.
Transaction log
The complete transaction log is the primary event form of the database existence. Sequential re-execution of the log contents gives the contents of the database completely identical to that obtained in the working multi-user mode. This feature makes use of the journal as an element of ensuring reliability.
In addition, it is worth noting that the magazine has another function - the fiscal function, which has repeatedly proved its usefulness in disassemblies like "... this program has been messed up."
Burn to disc
Saving data to disk is done by a write stream - a background thread serving file storage. Information about the stored data: a pointer to the memory area, the size of the area and the type of data the record stream extracts from the queue. Upon completion of writing to the file, the stream itself frees the memory area transferred to it.
The data type defines the target file to which the data will be written. There are only three main data types: formalized transactions, object banks, and DAT.
All transaction write stream writes to one constantly open transaction log file. In the transaction file, one after the other, without alignment. Before recording a transaction, its checksum is calculated and fixed in the header, and at the end of the recording a forced commit is made.When the queue comes to save the object bank, the write stream first creates the next checkpoint, and only then writes the bank to the file. To create a checkpoint, the write stream closes the current log file, packs it into a backup copy, packs the contents of the bank into a separate file, packs the backup copy of the full DAT, and opens a new log file.As a result of the write flow actions, the file storage takes the following form:
The workspace contains DAT and object banks. In the backup area are archived copies of banks of objects and files, collectively forming a complete transaction log. Each separate transaction log file, together with an archive copy of the DAT, as well as archive copies of its own and previous object banks, forms a separate checkpoint.Check Point
During the initialization of the file storage at system start-up, its validity is checked. The storage is considered suitable for work if the checksums coincide with the values in the headers of the files, and the contents of the workspace correspond to the contents of the backup area.If the server has not been properly shut down, then at least the DAT in the workspace and the last transaction log file will not pass verification, and the recovery process from the most recent checkpoint will automatically start.The recovery logic is fairly obvious: the missing or damaged banks are restored from the backups of the object banks, and then the DAT file is restored from the most recent checkpoint, after which transactions from its log are sequentially executed. Since the safety of the most recent transaction in the log is not guaranteed, if the transaction checksum does not match, the process ends.Garbage collection
We have only two physical resources at our disposal: memory and processor cycles. And since performance is a priority, the result is increased memory consumption. So, in order to reduce the volume of file operations, new and changed objects are stored on disk in bulk, in one file. At the same time, earlier versions of changeable objects remain at the same places in previous banks and will never be used. To return the disk memory to the system, the internally “thinned” bank must be periodically “compacted”, reducing its size, and not forgetting to make changes to the DAT and overwrite the archive copy of the bank. To facilitate the analysis of bank occupancy, the object service permanently maintains the cluster bitmap up to date.Server architecture
All of the relatively simple functionality discussed above is grouped around the internal control data structures of the File Storage and Object Service. File storage is responsible for the reliability of long-term data storage, which provides both multiple backup of the database in its various forms, including the transaction log, and the presence of auto-recovery mechanisms. The object service provides minimal resources for multi-user access to the contents of the database.
The data model by means of structured metadata, also stored in the object format, ensures the logical connectedness of data objects and the consistency of their values, in full accordance with the business logic of the application integrated directly into the metadata.Custom layerCursors , which are a kind of logical similarity to SQL cursors, are the server side of the interface resources used for user interaction with the contents of the database. This layer, which is very important, among other things, ensures complete isolation of the interface from the internal system of identification of objects and their values.The internal logic of the Model and Cursors, as well as the ways to implement it, is a topic for a separate story.Scalability
The scalability potential is provided by two factors: the isolation of a single object, as well as the separation of user actions into a sample and a transaction.If you need to increase the load capacity, the first thing that comes to mind is the allocation of a separate master server from the server pool. The master server stores a reference copy of the database and deals exclusively with the execution of transactions. Transactions come to him from the rest of the servers involved in servicing user requests and forming samples. Results of execution - a stream of modified versions of objects, the master server broadcasts to all other servers serving data retrieval, simultaneously providing multiple replication of the database.The presence of significant heterogeneity of the logical connection of objects in the database (objects can be grouped into domains with strong cohesion within the domain and a small number of links beyond it) allows you to distribute the database into several master servers, each of which serves its own domain groups.A detailed analysis of the specifics of implementation is beyond the scope of this article, but its principles themselves are a subject for discussion.Behind the scenes
As is often the case when presenting a fairly voluminous material, many relatively minor and minor details have been omitted. For example, the following were not mentioned: segmentation and duplication of DAT in memory; special order of management of "large" objects; principles of organizing mutual blocking of user flows when accessing shared resources; logic and implementation of garbage collection; displaying the execution process in a log log; collection and display of statistics; and much more.
It was important to show that multi-stream management of objects is based on a rather trivial logic and is not so difficult to implement the task.Summary
The implementation variant brought to primitivism is most likely to be the most efficient and productive. Although on this issue, opinions may be divided. Object presentation of data looks more natural than tabular. Simplification of internal identification provides “step-by-step” accessibility of objects, and promises a certain profit when implementing their logical connection.The material is published with the hope that it will be useful to anyone interested in database architecture.