Demo database for PostgreSQL

In this post I will talk about our demo database for PostgreSQL: why it is important for us and how it can be useful to you, how the scheme is arranged and what data it contains.

Immediately I will provide a link to the full description (it also says where to get the demo database and how to install it).

What for?

From our point of view, the need for a demonstration base is long overdue. To discuss almost any possibility of a DBMS, we need some data, we need a table or several tables. To reinvent this bike every time, to waste both the attention of the listener and their own time. Not without reason, every DBMS vendor has a base, which it uses every time it is necessary to demonstrate something.

What may need such a database?

First, to learn SQL yourself. Suppose you are a student, master SQL and read, say, about recursive queries. Is it necessary to practice on something?

On the other hand, in order for a student to read about recursive queries, you need someone to write about it. Our company cooperates with several authors and is currently working on a university course on database technology and a SQL tutorial — both books will use a demonstration base. It will be possible not just to read the book, but also to reproduce the examples given in it, to “play around” with them, to carry out practical tasks.

Another option is to practice on the database course at the university (or even reading the commercial course on SQL: the PostgreSQL license, under which the demo database is released, this permits). Examples of such use already exist.

It is useful to use the demo database for writing notes to a blog or articles about PostgreSQL and its capabilities. Instead of starting each time with the words “create a tablet and insert some data using generate_series,” you can get down to business right away.

We also think about processing PostgreSQL documentation over time, so that it also relies as much as possible on the schema and data of the demo database.

What do you need?

We have put forward several requirements for the demo database:

The data schema should be simple enough to be understood without much explanation.
At the same time, the data scheme must be sufficiently complex to allow it to build not the most trivial queries.
The database should be filled with data resembling real data that will be interesting to work with.

Of course, first of all we looked at which bases already exist , but none of them suited us. In no case do I want to say that they are “bad”, but were created for other tasks: in some too simple schemes, some too specialized, in some too primitive content.

Data schema

Therefore, in the end we made our own database. As you may have already guessed from the picture, air transportation was chosen as the subject area: we are talking about our subsidiary airline (so far, alas, still non-existent). The data scheme is shown in the figure:

The main essence here is booking (bookings).

It is possible to include several passengers in one reservation, each of which is issued a separate ticket (s). As such, the passenger is not a separate entity: for simplicity, we can assume that all passengers are unique.

The ticket includes one or more flights (ticket_flights). Several flights may be included in the ticket in several cases:

There is no direct flight connecting points of departure and destination (flight with transfers);
Take a ticket "back and forth."

There is no hard limit in the data scheme, but it is assumed that all tickets in the same booking have the same set of flights.

Each flight (flights) follows from one airport (airports) to another. Flights with one number have the same points of departure and destination, but will differ in the date of departure.

Upon check-in for the flight, the passenger is issued a boarding pass (boarding_passes), which indicates the seat in the aircraft. A passenger can only check in for the flight that he has on the ticket. The combination of the flight and the seat in the aircraft must be unique in order to prevent the issuance of two boarding passes for one seat.

The number of seats (seats) in the aircraft and their distribution by class of service depends on the model of the aircraft (aircrafts) performing the flight. It is assumed that each model has only one cabin layout. The data scheme does not control that the seats in the boarding passes correspond to those in the aircraft.

All schema objects are described in detail in the document to which I already referred at the beginning of the article. There is also a “guide” for tables in the form of simple queries.

What's inside?

To learn how to write queries, you need the database to already contain some data, and not a couple of lines, but a rather large array. Our demo database is available in three options, differing in data volume:

A small database contains data on flights for one month; it does not take up a lot of disk space, but it completely allows you to write queries.
The average base covers three months.
A large database of flights for the year will already allow you to directly feel the nuances associated with performance.

In general, the generation of test data is in itself a fascinating task, and then it will be about it. And what is so interesting here, because for a long time there are tools (for example, DataFiller ) that solve this problem? Yes, they do, but it all depends on the quality of the information that suits you.

For example, the ticket has the passenger's first and last name. How can I generate data for this field? You can think of several options.

The simplest is to form strings of a given length from random characters. Ray Bradbury could afford Mr. Aaa, but are you ready to meet with QDEMFI TGBSWAJVZH (this, by the way, is not a made-up example)?

You can select values from a predefined list of first and last names. It will be more like the truth, but there is also such a thing as data distribution. If you choose one of the names equally likely, then Alexandrov will be about the same in the database as Poluektov. It would seem, what's the difference? And there is a difference, and a big one: if you need to get all Alexandrov, in a real database you will have to select about 10% of all rows, and there can be no Poluektov at all. This means that the query plans in one and in the other cases should be different - for this, the DBMS collects statistics on the distribution of data in columns.

A more honest way is to use the frequency characteristics for each first name and each last name. That is what we did. (One could also take into account national peculiarities and the change in popularity of names over time, but here it is important to stop in time.)

Here is another example. Our database contains about a hundred airports. Direct flights connect by no means all airports, but from any one you can get to any other with several transfers. In other words: the link graph should be incomplete, but coherent. How to generate it? And again, it all depends on what kind of data quality suits us.

In the simple case, you can associate the first arbitrary airport with the second at least arbitrary airport, then associate the second with the next, and so on several times. If each time we give preference to yet unrelated airports, then formally we will get a suitable graph. Will he look like the truth? Not in the least. This is what we can get (the color of the lines depends on the passenger traffic: the darker, the more loaded the route):

If you look closely, you can see that all cities are connected to each other with a fairly uniform web. And this is what the real graph of flights in Russia looks like (according to OpenFlights.org ):

A characteristic feature here is that the bulk of connections are concentrated in a small number of nodes. Such graphs are called scale-free ; the link can be found and their generation algorithms.

In our case, however, it is necessary not only to generate the graph, but also to impose it on real cities (after all, it is clear that in any case Moscow will be the biggest hub in Russia). In fact, this simplifies the task if you go beyond the demo database itself and look a little wider: to describe each airport, we use not only coordinates, but also a few more characteristics. One of them is the volume of passenger traffic, and the graph generated with it you saw at the very beginning of the article.

And why not just take the routes of some existing airline? You can, of course, and so, but flexibility is lost: having an algorithm, you can generate a believable graph for any number of cities, or for a fictional country, or for intergalactic flights in general.

- By the way, what is the maximum number of transfers needed to get from any airport to any other? (Of course, the answer to this question should be a SQL query.)

Well, here we have generated the route graph, but it still needs to be turned into a regular flight schedule. Moreover, the flights between points A and B should be enough to transport everyone, but not too much, otherwise the planes will fly empty. And still need to take into account the type of aircraft. You can take a smaller plane, and make more flights.

- Are there any flights in the demo base that exceed the maximum range of the aircraft assigned to them?

And you can do the opposite - smaller flights, but a bigger plane. That's just not all airports can take heavy wide-bodied vessels; You can also check this if you wish, although we did not carry information about airport classes into the demo database itself.

Well, and so on. Here are some more questions that hint that data generation is not as trivial as it might seem at first glance:

- How is the actual flight time different from the planned?

- Usually, flights from west to east are long (we take off at night, arrive the next morning), and from east to west - short (we arrive on the same day almost at the same time). And what happens in the demo database?

- How is the booking time and check-in time distributed in relation to the date and time of the flight?

- How many people usually get in one booking?

- Are there passengers flying back and forth? Does the “there” route always coincide with the “back” route?

- Do all passengers have seats in the boarding passes that correspond to the class of service chosen when booking?

- Could it be that the passenger was issued a ticket to a seat that is not in the cabin? Can two passengers claim one seat?

- Do tickets for the same class of service on the same flight always cost the same (and why)?

At last

We hope that it will be no less interesting for you to work with this data, than we were interested in working on them. Further (although not immediate) plans include the development of a scheme to cover more “advanced” areas: full-text search, semistructured information, temporal data, various indexing strategies.

If you find any inconsistencies of the demonstration data with common sense (and this may well happen - it is difficult to foresee everything), do not hesitate to write us at edu@postgrespro.ru .

We are also very interested to hear about the actual use of the data scheme. Share your experience, and we, in turn, are open for communication and ready to share ours.

Source: https://habr.com/ru/post/316428/

All Articles