Foreword
The normalization of relations (tables) is one of the fundamental parts of the theory of relational databases. Normalization is intended to get rid of redundancy in the relationship and modify their structure so that the process of working with them is not burdened by various extraneous difficulties. If you ignore this approach, the design efficiency rapidly decreases, which, together with other such liberties, can lead to critical consequences.
It is useful for any specialist, by the nature of his activity, one way or another connected with the design of relational databases, to understand and be able to normalize relations. And with this post I would like to start a small series of publications devoted to normal forms, aimed at giving those Habrahabr readers, who for various reasons have not yet mastered this topic, the ability to easily fill this gap in knowledge.
The article is not intended to provide a detailed and accurate statement of the principles of normalization, since this is obviously not possible within the framework of a blog due to the large amounts of information required for publication with this approach. In addition, for this purpose there is a large amount of literature written by excellent experts. My task, I believe, is to popularly demonstrate and explain the
basic principles.
')
Terms Used
An attribute is a property of some entity. Often called a table field.
An attribute domain is the set of valid values ​​that an attribute can take.
A tuple is a finite set of interrelated valid attribute values ​​that together describe an entity (a row of a table).
A relation is a finite set of tuples (table).
A relationship diagram is a finite set of attributes that define an entity. In other words, it is a table structure consisting of a specific set of fields.
Projection is a relation obtained from a given by deleting and (or) rearranging some attributes.
The functional dependency between attributes (attribute sets) X and Y means that for any valid set of tuples in this respect: if two tuples match on the value of X, then they match on the value of Y. For example, if the value of the attribute "Company Name" is Canonical Ltd , then the value of the “Headquarters” attribute in such a tuple will always be Millbank Tower, London, United Kingdom. Designation: {X} -> {Y}.
First normal form
The relationship is
in the first normal form (abbreviated 1NF) if all its attributes are atomic, that is, if none of its attributes can be divided into simpler attributes that correspond to some other properties of the described entity.
We will call the original relation the main one, and the value of a non-atomic attribute - subordinate.
In order to normalize the original relation, the attributes of which are non-atomic, it is necessary to combine the schemes of the main and subordinate relations. In addition, if, for example, the table corresponding to the unnormalized relation is already contained in the database and is filled with information, the task is complicated by the fact that the value of a non-atomic attribute can in turn contain several tuples.
It is necessary to clarify the above with an example. Consider a relationship that has the attributes "Employee ID", "Full Name", "Position", "Projects". Obviously, one employee can work on several projects. Suppose that a project is described by an identifier, a name and a date of delivery.
Employee code | Full name | Position | Projects |
one | Ivanov Ivan Ivanovich | Programmer | ID: 123; Name: Steam boiler control system; Delivery date: 09/30/2011 ID: 231; Name: PS for control and notification of the maximum permissible concentration of various gases in the room; Delivery date: 11/30/2011 ID: 321; Name: Face Recognition Module for the protection system; Delivery date: 12/01/2011 |
It is easy to see that not all the attributes of this relationship are atomic (indivisible). In particular, the “Projects” attribute can be divided into three simpler attributes: “Project code”, “Name”, “Date of delivery”, and the value of this attribute for an employee Ivan Ivanovich Ivanov contains several tuples — information about three projects.
Note: from a certain point of view, the “full name” attribute can also be considered non-atomic, and in this case it should also be divided into simpler ones, such as “Last Name”, “First Name”, “Middle Name”.Now it is time to consider the relationship normalization algorithm to 1NF.
- Create a new relationship whose scheme will be obtained by merging the main and subordinate schemes of the original relationship into one.
- For each tuple of the original relation, include in the new as many lines as there are tuples in the subject relation of this tuple.
- Fill in the values ​​of the attributes of the new relationship that match the attributes of the subordinate relationship.
- Fill the lines of the new relationship with the values ​​of the atomic attributes of the source.
Apply this algorithm to the above relation. The scheme of the new relationship will consist of 6 attributes: “Employee ID”, “Full Name”, “Position”, “Project ID”, “Name”, “Delivery Date”. For one single tuple of a given relation, we add three new lines to the new one for each project (by the number of tuples in the subordinate relation). Now you can fill in the values ​​of separated attributes with tuples from the subordinate relation. Then we transfer to each of these lines the values ​​of atomic attributes: “Employee ID”, “Full Name”, “Position” (as you guessed, all three lines will contain the same values ​​of these attributes).
The result will look like this:
Employee code | Full name | Position | Project code | Title | Date of delivery |
one | Ivanov Ivan Ivanovich | Programmer | 123 | Steam Boiler Control System | 09/30/2011 |
one | Ivanov Ivan Ivanovich | Programmer | 231 | PS for monitoring and notification of the exceedances of MPC of various gases in the room | 11/30/2011 |
one | Ivanov Ivan Ivanovich | Programmer | 321 | Face Recognition Module for Security System | 12/01/2011 |
Second normal form
It is clear that the relation in 1NF may also be redundant. To eliminate it is the second normal form. But before proceeding to its description, you must first identify the shortcomings of the first.
Let the original relation contain information about the supply of certain goods and their suppliers.
Supplier Code | City | City status | Product code | amount |
one | Moscow | 20 | one | 300 |
one | Moscow | 20 | 2 | 400 |
one | Moscow | 20 | 3 | 100 |
2 | Yaroslavl | ten | four | 200 |
3 | Stavropol | thirty | five | 300 |
3 | Stavropol | thirty | 6 | 400 |
four | Pskov | 15 | 7 | 100 |
It is known in advance that in this respect the following functional dependencies are contained:
{{Supplier Code, Product Code} -> {Quantity},
{Supplier Code} -> {City},
{Supplier Code} -> {Status},
{City} -> {Status}}
Primary key in relation to: {Supplier Code, Product Code}.
Obviously, the relationship is redundant: it describes two entities - the supply and the supplier. In this regard, the following anomalies occur:
- Anomaly insertion. Information about a supplier who has not yet delivered a single product can be added to an attitude.
- Anomaly removal. If there was only one delivery from the supplier, then deleting the information about it will also delete all information about the supplier.
- Anomaly updates. If you need to change any information about the supplier (for example, the supplier has moved to another city), then you will have to change the attribute values ​​in all delivery records from him.
The physical meaning of the redundancy of the original relationship lies in the fact that it describes
not one entity, but
two - the
supply and the
supplier .
To eliminate these anomalies, it is necessary to break the original relation into a projection:
- The first one should include the primary key and all non-key attributes that are clearly dependent on it.
- The remaining projections (in this case, it is one) will include non-key attributes that depend on the primary key implicitly , along with the part of the primary key on which these attributes depend explicitly.
As a result, two relationships will be obtained:
Supplier Code | Product code | amount |
one | one | 300 |
one | 2 | 400 |
one | 3 | 100 |
2 | four | 200 |
3 | five | 300 |
3 | 6 | 400 |
four | 7 | 100 |
The first relation now corresponds to the following functional dependencies:
{Supplier Code, Product Code} -> {Quantity}
Supplier Code | City | City status |
one | Moscow | 20 |
2 | Yaroslavl | ten |
3 | Stavropol | thirty |
four | Pskov | 15 |
The second relation is:
{{Supplier Code} -> {City},
{Supplier Code} -> {Status},
{City} -> {Status}}
Such a partition eliminated the anomalies described above: you can add information about the supplier who has not yet delivered the product, delete the delivery information without deleting information about the supplier, and easily update the information if the supplier moved to another city.
Now we can formulate the definition of the second normal form, to which, most likely, the reader has already been able to guess independently: the relation is
in the second normal form (abbreviated 2NF) if and only if it is in the first normal form and each of its non-key attributes is irreducibly dependent on primary key.
Literature
For a more in-depth and thorough study of this topic, we recommend the book “Introduction to Database Systems” by Chris J. Data, on the basis of which this article was written.