Neo4j VS MySQL

Foreword

As a third year student, I chose a topic for coursework: “Graph databases using the example of Neo4j”. Since until that time I had studied and used exclusively relational databases, I was wondering why the graph database would be at all, and when should its capabilities be better applied? After viewing many sites on the Internet, I found only a theory and no more. Since the theory is not always convincing and I would like to see some kind of statistics, I was curious and I decided that in my coursework I would do it, and I chose MySQL as the enemy of Neo4j.

So who will win this standoff?

Briefly about Neo4j and graph databases

Neo4j is an open-source graph database, the story of which began with the investments of Neo Technology in 2003. Since 2007, it has become publicly available. Neo4j contains all database features, including adhering to ACID, maintaining clustering, and recovering from a system crash. Today is the leader among database graphs.

ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that ensure reliable operation of transactions.
')
The components of a graph database are nodes and edges. They can be supplemented with their own set of fields. A model of such a database is schematically shown in the figure.

Saving data in Neo4j

The nodestore.db file contains a certain size of records containing information about the Nodes:
1. A label that indicates the record is active;
2. Pointer to the first relation that this node contains;
3. Pointer to the first property that this node contains.

The node does not contain its own identifier. Since each entry in nodestore.db occupies the same amount of space, you can calculate a pointer to a node.

The relationshipstore.db file also contains records of the same size that describe relationships, but they consist of the following elements:

1. A label that indicates the record is active;
2. Pointer to the node that contains this relation;
3. A pointer to the node to which this relationship is directed;
4. Type of relationship;
5. A pointer to the relationship that is ahead (within the limits of this node);
6. A pointer to the relationship that is behind (within this node);
7. A pointer to the relationship that is ahead (within the limits of the node in which this relationship is directed);
8. A pointer to the relation that stands behind (within the limits of the node in which this relation is directed);
9. A pointer to the first property of this relationship.

Filling data

In order to compare for efficiency, it is necessary to fill the database with the same content. It is worth noting that this data should be large. On small volumes, we will not see the difference. Therefore, I set myself the goal to generate for the relational model no less than 300 megabytes.

I have chosen a subject area - a social network. I built an ER diagram, on the basis of which I created a relational and graph model.

After that, I filled MySQL with data, creating a class for generating this data and methods that contributed new data to the database.
The amount of data that was entered in MySQL:

addUsers (100,000); - number of users.
addGroups (200,000); - number of groups.
addPhotos (300,000); - number of photos.
addAudio (250,000); - The number of audio recordings.
addFriendships (1,000,000); - number of friends.
addMessages (4,000,000); - number of messages.
addUserAudio (350,000); - The number of audio recordings.
addUserGroups (400,000); - the number of user groups.
addUserPhoto (400,000); - the number of photos of users.

Then I converted this data to Neo4j. It is worth noting that filling the database took me a lot of time. If you decide to do something similar on your own, get ready for the fact that you either spend a lot of time or do not use text fields (I think it took a lot of time in my case). After that, I compared the place that information occupies in Neo4j and MySQL. There was a small shock; for a relational database, 351.5 MB was required, and for a graph database, 3.45 GB. Now we will start experiments.

Experiment 1

Experiment Description: measuring user search time by user ID.

In this experiment, I searched for users by their ID a certain number of times. For this, I wrote methods whose code is presented below.

public static void main(String[] args){ // Database connection initialisation init(); int [] counts = {10, 100, 1000, 5000, 10000, 30000, 60000, 90000, 120000}; for(int i = 0; i < counts.length; i++){ findUserByIDTest(counts[i]); } } static void findUserByIDTest(int count){ System.out.println("__________________________________"); System.out.println("STATISTIC FOR COUNT: " + count); { Random r = new Random(count); long time = - System.currentTimeMillis(); for (int i = 0; i < count; i++) { int id = r.nextInt(100000); User u = MySQLDAO.getUserByID(id); } time += System.currentTimeMillis(); System.out.println("Time for MySQL:" + time); } { Random r = new Random(count); long time = - System.currentTimeMillis(); for (int i = 0; i < count; i++) { int id = r.nextInt(100000); User u = Neo4jDAO.getUserByID(id);; } time += System.currentTimeMillis(); System.out.println("Time for Neo4j:" + time); } System.out.println("__________________________________"); }

After executing the program code, I made a table and drew a diagram for it. The results pleased me. Neo4j did a great job.

User count = 100000
Number of queries	Time for MySQL, ms	Time for Neo4j, ms
ten	four	9
100	34	76
1000	286	510
5000	1034	1103
10,000	1340	1187
30,000	3390	1384
60,000	6876	2102
90000	10537	3175
120,000	14033	3858

Experiment 2

Experiment Description: measurement of the time spent by the user's friends with a change in the value of the interval of occurrence of identifiers for the search.

In this experiment, I selected the range of valid identifier values and searched for the user's friends with this identifier. Then I changed this range and conducted the experiment again. After that, I came out the corresponding table with the data on which I built the dependency graphs.

In any case, the time of the graph base for searching for big data was less than that which is relative.

Experiment 3

Experiment Description: measuring the time spent by the total number of photos for users who administer at least one group, depending on the range of values of user IDs.

This experiment is very similar to the previous one, but is more complex. If the time values in previous experiments with repetitions differed slightly, in this jumps much stronger. Therefore, I conducted this experiment three times and compiled an appropriate table.

User count = 100000, Photo count = 300000, User photo count = 400000
	experiment 1		experiment 2		experiment 3
id range <	Time for MySQL, ms	Time for Neo4j, ms	Time for MySQL, ms	Time for Neo4j, ms	Time for MySQL, ms	Time for Neo4j, ms
ten	299	11339	164	92594	456	56575
100	10652	2748	674	2400	663	2826
1000	7243	4931	5433	2481	5649	1942
5000	22538	5521	23747	2408	23522	3514
10,000	47755	5627	44917	2650	44992	1844
30,000	137572	8161	33856	3108	136592	3707
60,000	64002	13577	300814	5004	280672	6029
90000	482329	13475	438875	5102	429304	5631

After discussions on Habré, I tried changing the approach to the conditions of the experiment: corrected the query, for a simpler one, raised MySQL again before each experiment and maintained conditions that MySQL and Neo4j should not work at the same time. The following happened: the MySQL graph became more stable, and the graph database began to lose a lot of relational data on small amounts of data.

	experiment 1
id range <	Time for MySQL, ms	Time for Neo4j, ms
ten	120	10767
100	690	10706
1000	5879	10884
5000	24668	12245
10,000	52462	12280
30,000	154534	13352
60,000	296369	14545
90000	489830	18058

Modest conclusions

I can safely say that the graph database requires much less time in searching on large amounts of data, but it takes up much more space on the hard disk. Therefore, for small systems, it is better to use a relational database or look for an alternative among other NoSQL databases.

Thanks for attention.

Source: https://habr.com/ru/post/258179/

All Articles