The Basics of Database Design - PostgreSQL, Cassandra and MongoDB Comparison

Hello, friends. Before leaving for the second part of the May holidays, we are sharing with you the material that we translated on the eve of the launch of a new stream in the “Relational DBMS” course.

Application developers spend a lot of time comparing several operational databases to choose the one that best suits their intended workload. Requirements may include simplified data modeling, transaction guarantees, read / write performance, scaling out, and fault tolerance. Traditionally, the choice begins with a database category, SQL or NoSQL, since each category provides a clear set of trade-offs. High performance in terms of low latency and high bandwidth is usually considered as a non-compromising requirement, and therefore is necessary for any database in the sample.
')
The purpose of this article is to help application developers make the right choice between SQL and NoSQL in the context of modeling application data. We will look at one SQL database, namely PostgreSQL and two NoSQL databases, Cassandra and MongoDB, to talk about the basics of database design, such as creating tables, filling them, reading data from a table, and deleting them. In the next article, we will definitely look at indices, transactions, JOINs, TTL directives, and JSON-based database design.

What is the difference between SQL and NoSQL?

SQL databases increase application flexibility through transactional ACID guarantees, as well as through its ability to query data using a JOIN in unexpected ways, over existing normalized relational database models.

Considering their monolithic / single node architecture and the use of the master-slave replication model for redundancy, traditional SQL databases do not have two important features - linear write scalability (i.e., automatic separation into several nodes) and automatic / zero data loss. This means that the amount of data received can not exceed the maximum throughput records of a single node. In addition, some temporary data loss should be taken into account in fault tolerance (in an architecture without resource sharing). Here we must keep in mind that recent commits have not yet been reflected in the slave copy. Upgrades without downtime are also difficult to achieve in SQL databases.

NoSQL databases are usually distributed by their nature, i.e. data in them is divided into sections and distributed over several nodes. They require denormalization. This means that the entered data must also be copied several times to respond to specific requests that you send. The overall goal is to achieve high performance by reducing the number of shards available during reading. From this follows the statement that NoSQL requires you to simulate your queries, while SQL requires you to simulate your data.

NoSQL focuses on achieving high performance in a distributed cluster and this is the main rationale for many database design compromises, which include the loss of ACID transactions, JOINs and consistent global secondary indices.

It is believed that, although NoSQL databases provide linear write scalability and high fault tolerance, the loss of transactional guarantees makes them unsuitable for mission-critical data.

The following table shows how data modeling in NoSQL differs from SQL.

SQL and NoSQL: Why do we need both?

Real-world applications with a large number of users, such as Amazon.com, Netflix, Uber, and Airbnb, are performing complex, multi-sorted tasks. For example, an e-commerce application like Amazon.com needs to store lightweight, highly critical data, such as information about users, products, orders, invoices, along with heavy, but less sensitive data, such as product reviews, support service messages. , user activity, reviews and user recommendations. Naturally, these applications rely on at least one SQL database along with at least one NoSQL database. In the interregional and global systems, the NoSQL database operates as a geo-distributed cache for data stored in a trusted source, an SQL database running in a single region.

How does YugaByte DB integrate SQL and NoSQL?

Built on log-based mixed storage engine, auto-sharding, sharding distributed consensus replication and ACID distributed transactions (inspired by Google Spanner), YugaByte DB is the world's first open source database that is also compatible with NoSQL (Cassandra & Redis) ) and SQL (PostgreSQL). As shown in the table below, YCQL, the YugaByte DB API compatible with Cassandra, adds the concepts of single- and multi-key ACID transactions and global secondary indexes to the NoSQL API, thereby opening the era of transactional NoSQL databases. In addition, YCQL, a PostgreSQL compatible YugaByte DB API, adds the notion of linear write scaling and automatic fault tolerance to the SQL API, presenting distributed SQL databases to the world. Since the YugaByte DB database is inherently transactional, the NoSQL API can now be used in the context of critical data.

As previously stated in the article “Introducing YSQL: A PostgreSQL Compatible Distributed SQL API for YugaByte DB” , the choice between SQL or NoSQL in YugaByte DB depends entirely on the characteristics of the main workload:

If the main workload is a multi-key operation with JOINs, then when choosing YSQL, understand that your keys can be distributed across several nodes, which will lead to a higher delay and / or lower throughput than in NoSQL.
Otherwise, choose either of the two NoSQL APIs, remembering that you will get better performance as a result of queries served from one node at a time. YugaByte DB can serve as a single operating database for real complex applications in which you need to manage multiple workloads at the same time.

The data section lab in the next section is based on the PostgreSQL and Cassandra API-compatible YugaByte DB APIs, unlike the original databases. This approach emphasizes the simplicity of interacting with two different APIs (on two different ports) of the same database cluster, as opposed to using completely independent clusters of two different databases.

In the following sections, we will introduce the data modeling laboratory to illustrate the difference and some common features of the databases in question.

Laboratory data modeling

Database installation

Given the emphasis on designing a data model (and not on complex deployment architectures), we will install the databases in the Docker containers on the local computer, and then we will interact with them using the corresponding command line shells.

Compatible with PostgreSQL & Cassandra, YugaByte DB database

mkdir ~/yugabyte && cd ~/yugabyte wget https://downloads.yugabyte.com/yb-docker-ctl && chmod +x yb-docker-ctl docker pull yugabytedb/yugabyte ./yb-docker-ctl create --enable_postgres

MongoDB

 docker run --name my-mongo -d mongo:latest

Access via command line

Let's connect to databases using the command line shell for the corresponding APIs.

PostgreSQL

psql is a command line shell for interacting with PostgreSQL. For ease of use, YugaByte DB comes with psql right in the bin folder.

 docker exec -it yb-postgres-n1 /home/yugabyte/postgres/bin/psql -p 5433 -U postgres

Cassandra

qlsh is a command line shell for interacting with Cassandra and its compatible databases through CQL (Cassandra query language). For ease of use, YugaByte DB comes with cqlsh in the bin .

Note that CQL was inspired by SQL and has similar concepts for tables, rows, columns, and indexes. However, as the NoSQL language, it adds a certain set of restrictions, most of which we will also cover in other articles.

 docker exec -it yb-tserver-n1 /home/yugabyte/bin/cqlsh

MongoDB

mongo is a command line shell for interacting with MongoDB. It can be found in the bin directory of the MongoDB installation.

 docker exec -it my-mongo bash cd bin mongo

Creating a table

Now we can interact with the database to perform various operations using the command line. Let's start by creating a table that stores information about songs written by different artists. These songs can be part of an album. Also optional attributes for the song - year of release, price, genre and rating. We need to take into account additional attributes that may be needed in the future through the "tags" field. It can store semi-structured data as key-value pairs.

PostgreSQL

 CREATE TABLE Music ( Artist VARCHAR(20) NOT NULL, SongTitle VARCHAR(30) NOT NULL, AlbumTitle VARCHAR(25), Year INT, Price FLOAT, Genre VARCHAR(10), CriticRating FLOAT, Tags TEXT, PRIMARY KEY(Artist, SongTitle) );

Cassandra

Creating a table in Cassandra is very similar to PostgreSQL. One of the main differences is the lack of integrity constraints (for example, NOT NULL), but this is the responsibility of the application, not the NoSQL database . The primary key consists of a partition key (the Artist column in the example below) and a clustering column set (the SongTitle column in the example below). The partition key defines in which partition / shard a line is to be placed, and the clustering columns indicate how the data inside the current shard should be organized.

 CREATE KEYSPACE myapp; USE myapp; CREATE TABLE Music ( Artist TEXT, SongTitle TEXT, AlbumTitle TEXT, Year INT, Price FLOAT, Genre TEXT, CriticRating FLOAT, Tags TEXT, PRIMARY KEY(Artist, SongTitle) );

MongoDB

MongoDB organizes data into databases (similar to Keyspace in Cassandra), where there are collections (similar to tables) containing documents (Documents) (similar to rows in a table). In MongoDB, in principle, no initial schema definition is required. The “use database” command shown below creates a database instance on the first call and changes the context for the newly created database. Even collections do not need to be created explicitly, they are created automatically, simply by adding the first document to a new collection. Please note that MongoDB uses a test database by default, so any collection-level operation without specifying a specific database will be executed in it by default.

 use myNewDatabase;

Getting information about the PostgreSQL table

 \d Music Table "public.music" Column | Type | Collation | Nullable | Default --------------+-----------------------+-----------+----------+-------- artist | character varying(20) | | not null | songtitle | character varying(30) | | not null | albumtitle | character varying(25) | | | year | integer | | | price | double precision | | | genre | character varying(10) | | | criticrating | double precision | | | tags | text | | | Indexes: "music_pkey" PRIMARY KEY, btree (artist, songtitle)

Cassandra

 DESCRIBE TABLE MUSIC; CREATE TABLE myapp.music ( artist text, songtitle text, albumtitle text, year int, price float, genre text, tags text, PRIMARY KEY (artist, songtitle) ) WITH CLUSTERING ORDER BY (songtitle ASC) AND default_time_to_live = 0 AND transactions = {'enabled': 'false'};

MongoDB

 use myNewDatabase; show collections;

Entering data into the PostgreSQL table

 INSERT INTO Music (Artist, SongTitle, AlbumTitle, Year, Price, Genre, CriticRating, Tags) VALUES( 'No One You Know', 'Call Me Today', 'Somewhat Famous', 2015, 2.14, 'Country', 7.8, '{"Composers": ["Smith", "Jones", "Davis"],"LengthInSeconds": 214}' ); INSERT INTO Music (Artist, SongTitle, AlbumTitle, Price, Genre, CriticRating) VALUES( 'No One You Know', 'My Dog Spot', 'Hey Now', 1.98, 'Country', 8.4 ); INSERT INTO Music (Artist, SongTitle, AlbumTitle, Price, Genre) VALUES( 'The Acme Band', 'Look Out, World', 'The Buck Starts Here', 0.99, 'Rock' ); INSERT INTO Music (Artist, SongTitle, AlbumTitle, Price, Genre, Tags) VALUES( 'The Acme Band', 'Still In Love', 'The Buck Starts Here', 2.47, 'Rock', '{"radioStationsPlaying": ["KHCR", "KBQX", "WTNR", "WJJH"], "tourDates": { "Seattle": "20150625", "Cleveland": "20150630"}, "rotation": Heavy}' );

Cassandra

In general, the INSERT in Cassandra looks very similar to that in PostgreSQL. However, there is one big difference in semantics. In Cassandra, the INSERT actually a UPSERT operation, where the last values are added to the string, in case the string already exists.

Data entry is similar to PostgreSQL INSERT above.

MongoDB

Although MongoDB is a NoSQL database, like Cassandra, its data entry operation has nothing to do with semantic behavior in Cassandra. In MongoDB, insert () does not have a UPSERT capability, which makes it look like PostgreSQL. Adding default data without _idspecified will add a new document to the collection.

db.music.insert( {
artist: "No One You Know",
songTitle: "Call Me Today",
albumTitle: "Somewhat Famous",
year: 2015,
price: 2.14,
genre: "Country",
tags: {
Composers: ["Smith", "Jones", "Davis"],
LengthInSeconds: 214
}
}
);
db.music.insert( {
artist: "No One You Know",
songTitle: "My Dog Spot",
albumTitle: "Hey Now",
price: 1.98,
genre: "Country",
criticRating: 8.4
}
);
db.music.insert( {
artist: "The Acme Band",
songTitle: "Look Out, World",
albumTitle:"The Buck Starts Here",
price: 0.99,
genre: "Rock"
}
);
db.music.insert( {
artist: "The Acme Band",
songTitle: "Still In Love",
albumTitle:"The Buck Starts Here",
price: 2.47,
genre: "Rock",
tags: {
radioStationsPlaying:["KHCR", "KBQX", "WTNR", "WJJH"],
tourDates: {
Seattle: "20150625",
Cleveland: "20150630"
},
rotation: "Heavy"
}
}
);

Query table

Perhaps the most significant difference between SQL and NoSQL in terms of query design is the use of the FROM and WHERE clauses. SQL allows you to select several tables after the FROM , and the WHERE can be of any complexity (including JOIN operations between tables). However, NoSQL tends to impose a hard constraint on the FROM , and to work with only one specified table, and in WHERE , the primary key must always be specified. This is due to the desire to improve the performance of NoSQL, which we talked about earlier. This desire leads to a complete reduction of any cross-tabular and cross-key interaction. It can lead to a large delay in inter-node communication when responding to a request and, therefore, it is best avoided in principle. For example, Cassandra requires that queries be limited to certain operators (only allowed =, IN, <, >, =>, <= ) on partition keys, except when requesting a secondary index (only operator = is allowed here).

PostgreSQL

Below are three examples of queries that can be easily executed by a SQL database.

Display all the songs of the artist;
Display all the songs of the artist, coinciding with the first part of the title;
Print all the songs of the artist with a certain word in the title and having a price less than 1.00.

 SELECT * FROM Music WHERE Artist='No One You Know'; SELECT * FROM Music WHERE Artist='No One You Know' AND SongTitle LIKE 'Call%'; SELECT * FROM Music WHERE Artist='No One You Know' AND SongTitle LIKE '%Today%' AND Price > 1.00;

Cassandra

Of the PostgreSQL queries listed above, only the first will work unchanged in Cassandra, since the LIKE operator cannot be applied to clustering columns, such as SongTitle . In this case, only the = and IN operators are allowed.

 SELECT * FROM Music WHERE Artist='No One You Know'; SELECT * FROM Music WHERE Artist='No One You Know' AND SongTitle IN ('Call Me Today', 'My Dog Spot') AND Price > 1.00;

MongoDB

As shown in the previous examples, the main method for creating queries in MongoDB is db.collection.find () . This method explicitly contains the name of the collection ( music in the example below), so a request for several collections is prohibited.

 db.music.find( { artist: "No One You Know" } ); db.music.find( { artist: "No One You Know", songTitle: /Call/ } );

Read all rows in a table

Reading all the lines is just a special case of the query template we considered earlier.

PostgreSQL

 SELECT * FROM Music;

Cassandra

Similar to the example in PostgreSQL above.

MongoDB

 db.music.find( {} );

Editing data in the table

PostgreSQL

PostgreSQL provides an UPDATE for modifying data. It does not have UPSERT capabilities, so the execution of this instruction will fail if the row is no longer in the database.

 UPDATE Music SET Genre = 'Disco' WHERE Artist = 'The Acme Band' AND SongTitle = 'Still In Love';

Cassandra

Cassandra has UPDATE similar to PostgreSQL. UPDATE has the same UPSERT semantics, like INSERT .

Similar to the example in PostgreSQL above.

MongoDB
The update () operation in MongoDB can completely update an existing document or update only certain fields. By default, it updates only one document with the UPSERT semantics UPSERT . Updating of several documents and behavior similar to UPSERT can be applied by setting additional flags for the operation. As for example, in the example below, the genre of a particular artist is updated according to his song.

 db.music.update( {"artist": "The Acme Band"}, { $set: { "genre": "Disco" } }, {"multi": true, "upsert": true} );

Deleting data from a table

PostgreSQL

 DELETE FROM Music WHERE Artist = 'The Acme Band' AND SongTitle = 'Look Out, World';

Cassandra

Similar to the example in PostgreSQL above.

MongoDB

In MongoDB, there are two types of operations for deleting documents — deleteOne () / deleteMany () and remove () . Both types delete documents, but return different results.

 db.music.deleteMany( { artist: "The Acme Band" } );

Deleting a table

PostgreSQL

 DROP TABLE Music;

Cassandra

Similar to the example in PostgreSQL above.

MongoDB

 db.music.drop();

Conclusion

The debate about choosing between SQL and NoSQL has been raging for more than 10 years. There are two main aspects of this dispute: the database core architecture (monolithic, transactional SQL versus distributed, non-transactional NoSQL) and the database design approach (data modeling in SQL versus modeling your queries in NoSQL).

With a distributed transactional database, such as YugaByte DB, debates about database architecture can be easily dispelled. As data volumes become larger than what can be written to a single node, a fully distributed architecture that supports linear scalability of recording with automatic sharding / rebalancing becomes necessary.

In addition, as stated in one of the articles of Google Cloud , transactional, strictly coordinated architectures are now more widely used to provide better development flexibility than non-transactional, ultimately consistent architectures.

Returning to the discussion of database design, it is fair to say that both design approaches (SQL and NoSQL) are necessary for any complex real-world application. The SQL “data modeling” approach allows developers to more easily meet changing business requirements, while the NoSQL “query modeling” approach allows the same developers to handle large amounts of data with low latency and high throughput. It is for this reason that YugaByte DB provides the SQL and NoSQL API in a common core, and does not promote any one of the approaches. In addition, ensuring compatibility with popular database languages, including PostgreSQL and Cassandra, YugaByte DB ensures that developers do not have to learn another language in order to work with a distributed, strictly consistent database engine.

In this article, we figured out how the basics of database design differ in PostgreSQL, Cassandra, and MongoDB. In the following articles, we will dive into advanced design concepts such as indices, transactions, JOINs, TTL directives, and JSON documents.

We wish you a great rest of the weekend and invite you to a free webinar , which will be held on May 14th.

Source: https://habr.com/ru/post/451042/

All Articles

The Basics of Database Design - PostgreSQL, Cassandra and MongoDB Comparison

More articles: