The fact that the volume of data, the complexity of their structure, the complexity of the connections between them grow at an incredible pace, they have been writing on each fence for many years. The question of what to do with this entire landfill usually hangs in the air. Or, more precisely, rests on the concept of "data model".
Although formally there are a lot of data models (network model of C. Bachmann, E. Codd's relational model, P. Chen's ER model, various object models, something specialized in temporal and spatial data, multidimensional cubes, etc.), all they, with the exception of the first two, are intended for presenting data to the end user and / or analyzing it by applied utilities, and at the data access level they rely on one of the basic (usually RMD).
Nobody really knows what a data model is. There are many definitions, but there is no universally accepted wording. The author of the term “data model”, Edgar Codd (he is the author of the RMD), defines it as a combination of three components:
- Collections of types of data objects that form the basic building blocks for any database of this model.
- Collections of general integrity rules that restrict a set of instances of these types of objects.
- Collections of operations applicable to such instances.
Modern definitions differ little from the above. If we combine all the signs characterizing a concept into a general formulation, the definition looks like this:
“A data model is a logical definition of objects, operators and rules, which together form an abstract data access machine for a user”.
Such a definition, in fact, does not mean anything at all. No one. Meanwhile, Christopher Date once said a wonderful phrase:
“The data model is a theory, or a modeling tool, while the database model (database schema) is the result of a simulation. The relationship between these concepts is similar to the relationship between a programming language and a specific program in that language. ” As we see, Date, unwittingly, gave a completely brilliant definition of a data model, very compact and understandable even to a child: a model is a language!
')
Now let's take a closer look at the two basic data models (Codd and Bachman), which, unlike other models, represent a fundamentally different view of the data.
Network model
The graph representation of the data (at first glance) is easy to implement on a computer and more naturally in many situations. Websites, XML documents, relational and textual data can be modeled as a graph. The universally recognized advantages of the network model are flexibility and variety of stored data structures, high access speed. Its disadvantages include the need for element-by-element data processing, the greater complexity of developing procedures for their processing and, oddly enough, the complexity of modifying the database schema, due to the need to modify the edges of the graph when updating data or changing their structure.
Perhaps the main complaints about the network model are that the logic of the data sampling procedure (supposedly) depends on their physical organization, and also that the integrity control of the data is weakened (supposedly) because of the admissibility of installing arbitrary links between elements. These claims are valid only if the network model is its reference version, described in the report of the working group on database languages ​​(COnference on DAta SYstem Languages) CODASYL. Meanwhile, the CODASYL approach is far from the only one: you can design many other implementations of the network model, including those that are free from most of the disadvantages attributed to it.
It's funny, but the definition of the network data model was given by its main competitor, the same Codd. Even more funny, when comparing these models so far,
abstract RMD is contrasted to
specific specifications of CODASYL (Codd himself called it “comparison of apples with oranges”!), Which is actually a scam and a degradation of the real possibilities of the network model, which can be much more serious competition RMD than now. Unfortunately, this Kodd's “slander” on the network model has got into almost all textbooks and is directly connected with the identification of the network model with the CODASYL approach. If for the RDM
“obtaining physical access to a tuple is not a matter of the data model” , then for the network model
“changing the database structure requires a lot of effort and time, since the operations of modifying and deleting data require swapping the pointers” is an obvious distortion. Or one more thing:
“Pointers do not have to be presented at the physical storage level as pointers, however, users must treat them as real pointers - such is the network model.” Not "network model", but only an idea about her Codd!
Relational model
Since now almost all DBMSs are relational, it is natural to assume that the RDM has a number of important advantages (first of all, a high degree of data independence). Especially since many clients find it convenient to present data in simple ways, with clear data structures and the usual Structured Query Language (SQL). But there are many flaws in RMD too, and they are fairly obvious: the absence of direct references to the fields of the tuple, the indecently high level of costs for creating or upgrading the database schema, and changing the schema is sometimes impossible without abandoning old data, because structural information in them is absent. A widespread situation for real tabular data, when it is impossible to unambiguously identify an element according to its attributes, is simply forbidden in RMD. A join operation on one or more attributes, the main mechanism used in the relational model to link data from different relationships, is always performed to the whole tuple, but not to its part. Queries on multiple tables are executed for a very long time: conceptually, relational algebra operates with the concept of “Cartesian product”, which is too expensive to use, therefore, different query optimization techniques are used. Normal forms, hierarchical in nature, also limit the complexity of the supported data structures. Meanwhile, the idea of ​​Codd himself was that "when choosing logical data structures, there should be only one consideration - the convenience of the majority of users." Such rigidity of the approach to the simplicity of data presentation (which all this time had a clear tendency to complication), supplemented by the appearance of normal forms, has already led to the situation that not a single (!) Real RDBMS actually supports RDBs. In particular, the refusal to normalize tables became common.
Codd himself promoted the three-valued logic (true, false, and NULL), and in 1990 even the four-valued logic, however, due to the high complexity (!) Of K. Date and H. Darwen's Third Manifesto, both ambiguous values ​​and many-valued logic are prohibited! I quote:
Third manifesto
H. Darwin and C. Date, Translation: M.R. Kogalovsky
RM bans
4. Each attribute of each tuple of each relationship must have a value, which is a value from the corresponding domain.
Comments (ibid.):
In other words, no more ambiguous values ​​and no more multi-valued logic!
The tendency to simplify runs a red thread through the entire RMD, which inevitably leads to limitations and in the complexity of the data itself. In other words, the data in the DDB is chronically primitive: at least no complicated structures can be kept there. And neither the DBMS nor the database designers can bypass this limitation.
Codd's goal is to “move application programming to a level where relationships are treated as operands, rather than processed elementwise.” Remarkable goal: a language that allows you to work with sets is indisputably convenient. The RMD, however, involves working
only with sets: no tuple operations, a single tuple is just a special case of a set, the concept of a cursor is prohibited, etc. And even in such “short pants” it becomes already close even to SQL — the closest relative of the RMD, although in mathematics (and in SQL, with which relationals like Data struggle for life and death), the tuples are ordered, their duplicates are allowed. Moreover, Codd himself originally defined tuples that way! Waiver of the ordering is considered by Data as “the greatest contribution (!) Of Codd to the RMD”. In OODB, however, for some reason, support for the individuality of objects is again required, i.e.
"Objects must have a unique identifier that does not depend on the values ​​of their attributes .
" Why? Because the “greatest contribution” is complete fiction!
Suppose we have an RDB consisting of only one relation. Let's get an additional column, the values ​​of which will be the sequence numbers of tuples (from one). The column names also set their sequence numbers (from scratch). We define the zero column as the primary key (or “surrogate”, or “system-generated”, as the authors of the Third Manifesto recommend “very strongly”). Now we can access the columns and tuples by their number (as in SQL), but in strict accordance with the RMD. What has changed? We just lost the possibility of direct access to the fields and tuples (since the RM-prescriptions prohibit the ordering of attributes or tuples), and automatically ceased to be afraid of duplicate tuples and undefined values. And where are the advantages of the fact that our attitude now perfectly meets the requirements of the Third Manifesto? They are not! Outside the RMD, we can refer to the tuples of this relation from others (or from itself) in the same way, we can calculate the physical address of the tuple by the surrogate key value (and control it if desired by the value of the zero column), i.e. the foreign key really becomes a pointer. Thus, concepts embodied in SQL are at least as good as RMD. So why
“hopelessly follow the perversion of RDM, embodied in SQL” , as Data wrote with Darwen? Why should
“in order to withstand the test of time, we should unambiguously reject SQL” ? Something is wrong here!
We believe that the abandonment of order is an inevitable consequence of the “structural aspect”: RBD data is defined as a set of relationships, i.e. primary element is a group. A completely inevitable and terrible consequence of the lack of orderliness is also the concept of a key. The concept of an identifier that is “removed from the data model in a“ cheating ”manner cannot in principle be removed from the RDBMS either; after all, even for a trivial reading of the key value, it is necessary to access the tuple in some other way. As you can see, it is impossible to get rid of the concept of identifier - you can only pretend that it does not exist during group data processing. In addition, talking about the speed of access to data with the concept of searching by key is a mockery of common sense. However, the RMD was never interested in questions of implementation, of obtaining physical access to a tuple: it prescribes associative data access (based on values), and how this is implemented in a particular system is completely unimportant for the model.
Database indexing not only does not provide direct access to data, but also causes a number of new problems. Index maintenance for unstructured data is much more complicated than for relational data. An attempt to index each element leads to the size of the indices, which is several times larger than the amount of the original data. Indexing is too laborious (at worst) on strongly related graphs. Thus, the indexes are too large to store, too resource intensive to build, and too complex to maintain. Finally, and we see this as the main drawback, the data cease to be independent: the need to keep the indexes up to date automatically leads to the inheritance of all the problems of early graph databases, because indexes are functionally equivalent to pointers.
There is only one type of data access that is suitable for a real working DBMS, and it is called “direct” (and the key concept, by the way, is explicitly prohibiting it - unlike SQL).
The third Manifesto: "RM-prescriptions and prohibitions cannot be the subject of compromise." Even the unfortunate cursor is "strictly forbidden", and more primitive navigation simply does not happen in nature! RMD specifically forbids them to have, because the key concept immediately turns any navigation operations into a recursive SQL query: once the identifier tuples cannot have, we will have to scour the values, even if we call them the key. And take out the obligatory primary key on the report, but put it. And the same tuples can not start, it will be bad. And no indexes, no crutches will save the ideologically defective design!
SQL
The relation to SQL for different users is very ambiguous, sometimes diametrically opposite. Personally, I like his following characteristic:
"SQL is one of the poorest languages, the crooked offspring of Donald Chamberlain's associates." More precisely, their followers — at least they understood that it was not a language:
“The development goal was to create a simple non-procedural language that could be used by any user, even without programming skills.” That is, it is a query language, and the end user! To call it "programming language" then no one in a nightmare could dream! And now the same Wiki blatantly asserts that
“SQL can be called a programming language,” shamefully adding that
“it is not turing-complete”.
Another phrase from the wiki:
“SQL (structured query language) is a declarative programming language used to create, modify, and manage data in a relational database. This is the only communication mechanism between application software and DDB. ”
Get stupid! The decoding of the language itself says that it is the language of REQUESTS, and in the “explanation” it means that it is the language of PROGRAMMING!
SQL standard. Separate song. First, it is indecently bloated in size with very weak language functionality (for example, the basic part of the SQL: 2003 standard consists of more than 1,300 pages of text). Even the Great and Mighty C will require no more than a dozen pages! And here what? A single select statement? What else? CREATE - INSERT - DELETE - UPDATE? GRANT - REVOKE? COMMIT - ROLLBACK - SAVEPOINT? Do not tell my slippers!
Versions of the standard language of the great set. They tried to push the graph emulation (the concept of primary and foreign keys), integrity control (poor to indecency) and somehow extend the functionality (support of regular expressions, recursive queries, triggers, non-scalar data types, some object-oriented features, extensions for working with XML data, window functions, the ability to share in SQL queries and XQuery, etc.), but without much success. In addition, different DBMS vendors use different SQL dialects, which are generally incompatible with each other. At the moment, all efforts to verify the DBMS for compliance with the standard fall on its manufacturer.
SQL is not a true relational language: it resolves duplicate rows in tables, which in the framework of the relational data model (by the way, even more miserable than SQL) is impossible and unacceptable, supports null values ​​(NULL) and multi-valued logic, uses the order of columns and column references by numbers, resolves unnamed columns and duplicate column names. Theorists such as Data and Darwen swear at him for this, which, in general, is understandable: it is well known from the school biology course that “the intraspecific struggle is the most cruel.”
SQL initially did not offer any ways of manipulating even hierarchical structures, let alone general graphs. Even recursive queries (which give an exponential slowdown in performance with an increase in the depth of connections) appeared in Microsoft SQL Server only in version 2005. SQL is not a familiar procedural programming language (that is, it does not provide means for building loops, branching, etc.), therefore DBMS vendors introduce various procedural extensions — stored procedures and procedural languages ​​— add-ins. Practically every DBMS uses its own procedural language (Oracle - PL / SQL, Interbase and Firebird - PSQL, in DB2 - SQL PL, in MS SQL Server - Transact-SQL, in PostgreSQL - PL / pgSQL, etc.). And what for we standardize SQL with its actually unique SELECT statement, which, moreover, can hardly do anything?
The remaining SQL tricks are inherited from the relation: a rigid and immutable database schema, tables for stored data structures, and no access speed (key concept). The degree of data independence is really high, but it is precisely in this gap that data shoals crawl in the data, most of which users and database owners do not even suspect.
NoSQL
Recently, the term “NoSQL” has become very fashionable and popular, all sorts of software solutions under this sign are being actively developed and promoted, all sorts of “smart words”, like “linear scalability”, “clusters”, “fault tolerance”, non-relationality, are spoken. ”NoSQL repository caught on as the main database for Instagram and Facebook social networks, but no “NoSQL revolution” happened - the relational databases consistently hold dominant positions. And the point is not even in the most powerful "relational lobby", but in the fact that this product it’s quite raw, it lacks a lot of basic things - universality, reliability, integrity and predictability. Therefore, the interpretation of the term “NoSQL” is increasingly shifting towards “Not Only SQL”, although initially it was even suggested that “NonRel” as an alternative, but quickly calmed down and now more than 90% of existing databases and databases are built on the relational principle, which is based on a tabular data composition scheme - especially considering that the overwhelming majority of other solutions, although called NoSQL, actually operate with tabs anyway by individuals. Data both there and there IDEOLOGICALLY tabular! In full accordance with the first of the “12 Codd Rules”, which says:
“All information in a relational database at the logical level must be clearly represented in a single way: values ​​in tables”.
Great Controversy
Today, thoroughly forgotten, the Great Controversy between Codd and Bachmann took place in 1974 at the ACM SIGMOD seminar, where each of the speakers sought to show the advantages of their approach. Then contemporaries said that the dispute ended in a draw, because none of those present (including the debaters themselves) did not understand anything. Now, in hindsight, it is considered that Codd won, because the actual DBMSs are now almost all relational. But so far, as an echo of this dispute, when talking about data models they oppose element-by-element and group data processing. In fact, such an opposition is simply nonsense.
A commonly used (although not always explicitly defined) term in all models is the field — the smallest, indivisible, atomic data element of a given type. In this case, the term “data type” can carry a very diverse semantic load: it can determine the amount of memory occupied by an element, the set of acceptable values, the set of operations associated with this type, the device descriptor, the connection with other elements, and much more. In addition to atomic, there are also data elements that are groups of fields, called by different authors "segments", "sets" and other terms. There are two types of groups of homogeneous elements: an array (a set of elements of a given size) and a string (the size of this group is determined by the predetermined value of the terminator). Groups of heterogeneous elements actually differ from each other only by name: node (graph model), tuple (relational model), class (object approach), structure (programming languages).
After agreeing on the terms, it is easy to see all the absurdity of the subject of the Great Dispute: the problem is not at all in comparing the specific descriptions of CODASYL and abstract RMD. And not that comparing the tabular and graphical representation of the data - in the end, it is just an abstraction. And not in the fundamental differences between the concepts of a key and an index is only a direct consequence of the main mistake: the comparison of groups with atomic elements. In the RMD, the primary element is a table, i.e. Group. But if we define the relation as an “array of tuples,” and the tuple itself as a “collection of fields,” then, indeed, it will be possible to make a real comparison of the fields and tuples of the RMD with nodes and edges of the network model.
Now we will repeat an attempt to compare these two models, bearing in mind that a node can contain not only an atomic element, but also a structure, and the term tuple is equivalent to struct (that is, a member of a group can also be a group). After that, we will not only be able to correctly compare data models, but also automatically obtain an object data model instead of relational (although at the cost of destroying relational algebra). And such a comparison would not be in favor of the RMD: we will immediately see a set of unnecessarily stringent restrictions on the complexity of the data in the relationship. In fact: why is it allowed to complete only homogeneous elements in a single relationship? Why it is impossible to address directly to the internal elements of the tuple? Why is a tuple not allowed to have its own data-independent identifier? Why should we rest on this unfortunate concept of a “key”? For which faults we were forbidden to have structures in a tuple more complex than a single-level tree? And why do we need this Procrustean bed of normal forms? For what? For the sake of simplifying the formal apparatus of relation algebra and relational calculus? And who actually needs them? DBMS developers? Administrators? Users?
If, on the contrary, we bring the terminology of the network model to the relational one (the primary element is a lot), we will also get the opportunity to correctly compare the two models, and it will again not be in favor of the RMD! We will need to explicitly register links to the relevant metadata, since ( , , ) . , : , .. , () (, ). ( ) . Fairy tale!
Search for information
, , . , , (- FIND.CURRENT.ALL), ( - !) . . , — «», , , , , , . , — , .
: «». - ( ), : , , , , , , -, ( ). !
, , , . ( ), « » « » ( ), « » . , - , , , . : B ( b). !
. — :
« ( ) — , ». ? ! ? ! ( ).
: , , (, , , ) . , . , . — , !
Conclusion
« » « »
,
. , , , . « », , ? :
!
, , , , , , (), .. , «» . CODASYL, «--». , , , , , () . , , , . , . . , , / . . , (BOOLEAN, FLOAT, URL) — ( ) , . (TUPLE, ENUM, RELATION) — , ! ( ), ( ). , ADDRESS (, , , , ) ( -> -> -> -> ). , ADDRESS : , : « ».
, , , . «» () . : «» , , . , , , , «» .
: , . - . - . , ( ), CODASYL ( , ). . - , - ! !