📜 ⬆️ ⬇️

The second normal form (in SQL terminology)

Since the first post has already ripped off the roof to several habrazhiteli in general and shook karma to me in particular, I decided to write a translation of the article in terms of the SQL language. It will be useful to me and, possibly, not only to me. In general, since my childhood I strive to ground the theory to practice with the help of various means, among which was alcohol, and it seems to me useless to spend time studying something that cannot be thought of in real life.

The only funny thing is that all this rubbish under the cut was born in Codd’s mind even before the emergence of SQL as a language, and now everything’s in terms of SQL ...


What is the second normal form or 2NF? So that the three-year-old child really understood ...
First, let's look at the goals pursued by normalization. There are no discrete terms under the cut ...

The goal of the first normal form of reduction (1NF) is to enable the WHERE clause to be used when selecting data with a SELECT query. Since all column values ​​are of the same and predefined type, they can be compared with each other and with constants.
')
For example, if in the table 'Family' there is a column 'Kids' of type VARCHAR, we can easily compare the two lines 'Vasya' and 'Anya' and determine their lexicographical order, for example, by the operator>

FamilyKids
IvanovsVasya
PetrovsAnya


If at some line in the 'Children' field is indicated 'Vanya, Sasha', we can no longer unambiguously determine the order of the children. To compare the lines "Vasya" and "Vanya, Sasha" is meaningless in this situation. since the first is a string, and the second is already a list. Suppose we want to find all the children in the letter 'C'.

FamilyKids
IvanovsVasya
PetrovsAnya
SidorovsVanya, Sasha


Request view

SELECT Kids FROM Family WHERE kids LIKE '%' 

will not work out in this situation as needed and will not find Sasha, since LIKE cannot parse lists, extract values ​​and treat them as arguments for comparison with the template. 'Vanya, Sasha' in this case is a non-atomic value of the type of a list of strings. To teach SQL to work with such data, you need to either expand the language or simplify the model to 1NF. Decomposition to 1NF is achieved by splitting the composite values ​​into atomic values:

FamilyKids
IvanovsVasya
PetrovsAnya
SidorovsVania
SidorovsSasha


That is, the first NF deals with the structure of column values.

The second (and the third, but not about it today) NF already deals with keys and dependencies between columns of a table. We list its goals with explanations.

  1. The main goal of bringing to the second normal form is the desire to get rid of the redundancy of data storage and, as a consequence, to avoid anomalies of modification of these data (anomalies of change, insertion and deletion)
  2. Second in order, but not by value, the goal of normalization in 2NF is to split the data model into separate tables as much as possible so that they can be combined and used in queries with new methods not originally intended.
  3. Minimize efforts to change tables if necessary. The fewer dependencies between columns of a table, the fewer changes in it will be required when the data model is changed.
  4. The clarity of the tables for the user. How to keep all the data in one large table, it is easier to present the data as several related and logically separated labels. It is easier to read, perceive, design and maintain. In the end, any data model starts on a blackboard or paper in the form of circles, blocks and lines that children and programmers love to draw.


For example, we have a table

IDCD_nameArtist
tenSix Degrees Of Inner TurbulenceDream theater
20Metropolis, pt. 2: Scenes From A MemoryDream theater
thirtyMaster of puppetsDream theater


where the primary key is the id . This scheme is in 2NF, because the Artist column, which is not included in the key, is determined only by the key entirely.

The table is in 2NF if any non-key column is defined only by the whole key and cannot be determined by its part.

In general, it is possible to raise the question of the 2NF mismatch only if there are composite keys in the table. Tables with simple keys, as in the example, always have 2NF. The specified table is just an example of such a case, since both keys in it (and this is the ID and the natural key CD_name ) are simple, and they have no parts.

The 2NF mismatch will look at the table

ArtistCD_nameTrackLyrics
Dream theaterSix Degrees Of Inner TurbulenceMisunderstoodPetrucci
Dream theaterMetropolis, pt. 2: Scenes From A MemoryOverture 1928(instrumental)
Dream theaterMaster of puppetsBatteryTtfield
MetallicaMaster of puppetsBatteryTtfield
EnsiferumTale of revengeBatteryTtfield


The same song can be included in several discs, albums of the same name with songs of the same name from different groups, such as tributes, are also theoretically possible. Therefore, the key will be {Artist, CD_name, Track} . The value of the Lyrics column, denoting the author of the words, is uniquely determined from the {Artist, Track} columns, which are part of the key. This is a violation of 2NF.

The consequence of this is the redundancy of values ​​in the Lyrics column for each disc that includes a song. In the field of music, these values ​​do not change, but in other domain domains, the careless change of such redundant data may lead to a contradictory state of the database, when not all values ​​will be updated. This is an example of modification anomaly.

Another consequence is that songs that have not yet been released on CDs, but simply broadcast by radio or released on other media, do not fit the specified data scheme. Accordingly, we will not be able to add a new song to the database until it is released on the CD. This is an example of insertion anomaly.

Similarly, if we want to remove any disc from the database, we will have to lose the information about the authors of all the songs that are included only in this disc, because in this model it is not possible to provide information about the author if the song is not included in any CD . For example, the desire to remove the Six Degrees Of Inner Turbulence disc will lead to the fact that the author of the song Misunderstood will be lost, which is unforgivable. This is an example of a deletion anomaly.

To avoid such anomalies and remove redundancy, we need to divide the table, that is, to decompose it into two:

ArtistCD_nameTrack
Dream theaterSix Degrees Of Inner TurbulenceMisunderstood
Dream theaterMetropolis, pt. 2: Scenes From A MemoryOverture 1928
Dream theaterMaster of puppetsBattery
MetallicaMaster of puppetsBattery
EnsiferumTale of revengeBattery


ArtistTrackLyrics
Dream theaterMisunderstoodPetrucci
Dream theaterOverture 1928(instrumental)
MetallicaBatteryTtfield


In a real database, for constructing queries, you also need to introduce semantic relations between tables, for example, to connect them with a foreign key, but for our example it is enough to understand that these tables are related by meaning.

Both tables have 2NF, the first is because all columns are in the key, and the second is because Lyrics is determined by the {Artist, Track} key and is not uniquely determined by any of the Artist or Track columns.

I’ll probably not be talking about the warehouse, I’m tired of typing tablets into html :)

That's all.
I hope it was clear right now, but I went to deal with 3NF!

Source: https://habr.com/ru/post/113274/


All Articles