Intro

PHP + Java. Picture taken
from here.In
this comment to the article entitled “Write the code every day,” I said that I would soon show my project, for which I allocated 1 hour daily (except weekends). Since recently my work is related to writing distributed Java applications that use in-memory data grid (IMDG) as data storage, my project is associated with this.
You can read more about IMDG in my previous articles (
1 ,
2 ). But briefly, this is a cluster distributed storage of objects by keys, which keeps all the data in memory, due to which high data access speed is achieved. Allows not only to store, but also to process data without removing it from the cluster.
And if each IMDG has its own data processing interface, the data access interface is usually identical to a hash table.
')
What is this article about
Most IMDGs are written in Java and support APIs for Java, C ++, C #, while the API for web programming languages (Python, Ruby, PHP) is not supported, and the protocol for writing clients is very limited. It is this fact that I consider the main brake on the penetration of IMDG into the masses - the lack of support for the most popular languages.
Since IMDG manufacturers do not yet provide support for web languages, web programmers do not have the ability to scale applications as easily as Java server developers have. Therefore, I decided to do something similar on my own and put it in open source, using IMDG JBoss Infinispan as an open source engine (JBoss, owned by Red Hat, is quite well known among java developers). My project is called
Sproot Grid , while it is available only for PHP, but if the community is interested, I will do the integration with Ruby and Python.
In this article I will once again tell you about the in-memory data grid and how to configure, run and use the Sproot Grid.
Why do you need IMDG?
The biggest bottleneck of many high-load projects is the data warehouse, in particular, the relational database. To combat the shortcomings of traditional databases, 2 approaches are mainly used:
1) Caching
advantages :
cons :
- real cluster solutions are very rare, basically the user himself has to deal with the distribution of data among servers, and when accessing data, determine the server on which this data rests. The uniformity of fullness of all cluster nodes in such a system is difficult to achieve.
- requires a compromise between the relevance of data and access speed, because the data in the cache may become outdated, and deleting old data from the cache and then caching new ones are additional delays and system load
- Typically, data is not cached as domain objects that are used in the application, but as BLOBs or strings, i.e. when using data obtained from the cache, you must first construct the necessary objects
2) NoSQL solutions
advantages :
- good horizontal scalability
cons :
- not so high speed of receiving results in case of disk usage
- it is almost impossible to ensure the operation of intra-corporate software, which is focused on working with a specific relational database
IMDG combines the advantages of both approaches and at the same time has several advantages over the above-mentioned solutions:
- good horizontal scalability
- high access speed
- real clustering (you can put data on any node, you can also request data on any node of the cluster), automatic balancing of data between nodes
- the cluster knows about all the fields of the object, therefore you can search for objects not only by keys, but also by field values
- it is possible to create indexes by fields or by their combination
- when using the read-through and write-behind (or write-through) mechanisms, the data will be synchronized with the database, which will allow other applications (or other application modules) to continue to use the traditional database (MySQL or Mongo - it does not matter)
- When using the work scheme from the previous paragraph, the problem of updating the data in the cache disappears, since they will always be there the same as in the database
Let's take a closer look at these 2 interesting mechanisms: read-through and write-behind (write-through)
read-through
Read-through is a mechanism that allows you to pull data from the database during a query.
For example, you want to get an object from the cache by the key '
key ', and it turns out that there is no object with such a key in the cluster, then this object will be automatically read from the database (or any other persistence storage), then put into the cache, after which will be returned as a response to the request.
If there is no such object in the database, null will be returned to the user.
Naturally, the required sql query, as well as the mapping of the results of a query on an object, lies on the user's shoulders
write-behind (write-through)
To optimize the write speed, you can write not to the database, but directly to the cache. It sounds strange at first, but in practice it relieves the database well and improves the speed of the application.
It looks like this:
- The user makes a call to cache.put (key, value) , the ' value ' object is stored in the cache by the key ' key '
- In the cluster, an event handler for this event is triggered, a sql query is created for writing data to the database and its execution
- The control is returned to the user.
This interaction scheme is called
write-through . It allows you to synchronize updates from the database simultaneously with updates in the cluster. As you can see, this approach does not speed up the process of writing data, but ensures consistency of data between the cache and the database. Also with this type of record, the data gets into the cache, which means access to them for reading will still be higher than the query to the database.
If simultaneous writing to the database is not a critical condition, then you can use the more popular
write-behind mechanism, it allows you to organize a postponed recording to the database (any other stack). Like that:
- The user makes a call to cache.put (key, value) , the ' value ' object is cached by the key ' key '
- The control is returned to the user.
- After some time (configurable by the user), a cache entry event handler is triggered
- The handler collects the entire pack of objects that have been modified since the previous handler’s triggering.
- The pack goes to the database to write
When using write-behind, the write operation is significantly accelerated, because the user does not wait for the update to reach the database, but simply puts the data into the cache, and all updates of the same object will be merged into one resulting update, while writing to the database in batches, which also has a positive effect on the loading of the database server,
Thus, you can configure your IMDG so that every 3 seconds (either 2 minutes or 50 ms) all data updates are asynchronously sent to the database.
What does this have in Sproot Grid?
In the first version, I decided not to immediately implement everything that I mentioned above, since it would take a lot of time, but I would like to quickly get feedback from users.
So, what is available in Sproot Grid 1.0.0:
- Horizontal scalability and fair clustering with balancing the amount of data between cluster nodes
- The ability to store both built-in PHP types and domain objects
- Ability to build an index on the field and search on this index
Getting Started
First you need to download the distribution
from here and unpack it.
Install the necessary software
Since JBoss Infinispan is a Java application, it was necessary to choose the way of interaction between Java and PHP. Apache Thrift was chosen as such a link (the protocol was developed for serialization and transport between nodes in Cassandra), so in order for Sproot Grid to work on your system, you need to install the following:
- Java
- Thrift - installation in production is not required, installation is needed only on the developer machine (see details in the Generate code section ). When deploying to production you only need to copy the .php files of the Thrift library and the java library in the .jar format
- PHP (if not already installed)
Installation instructions are located
on the project wiki.Configuration
The configuration file must be in $ deploymentFolder / sproot-grid / config / definition.xml, where deploymentFolder is the path to the directory where you unpacked the distribution
Configuration example:<?xml version="1.0" encoding="UTF-8"?> <sproot-config> <dataTypes> <dataType type="some\package\User" cache-name="user-cache"> <field name="id" type="integer" /> <field name="name" type="string" indexed="true" /> <field name="cars" type="array" key-type="string" value-type="some\package\Car"/> </dataType> <dataType type="some\package\Car" cache-name="car-cache"> <field name="model" type="string" /> <field name="isNew" type="boolean" /> </dataType> <dataType type="string" cache-name="string-cache"/> <dataType type="array" value-type="some\package\Car" cache-name="list-car-cache"/> </dataTypes> <cluster name="Sproot"> <multicast host="224.3.7.0" port="12345"/> <caches> <cache name="user-cache" backup-count="1"> <eviction max-objects="1000" lifespan="2000" max-idle-time="5000" wakeup-interval="10000" /> </cache> <cache name="car-cache" backup-count="1" /> <cache name="string-cache" backup-count="1" /> <cache name="list-car-cache" backup-count="1" /> </caches> <nodes> <node id="1" role="service" thrift-port="34567" minThreads="5" maxThreads="100" /> <node id="2" role="storage-only" /> </nodes> </cluster> </sproot-config>
More information about the configuration can be read
on the project wikiAs you can see from the configuration, for each type of objects we can prescribe the name of the cache (or we can not prescribe if we don’t want to store such objects in a separate cache). Cache is a hash table distributed across a cluster; there can be any number of caches in a cluster. Only objects of the same type can be stored in the same cache.
All caches must be described in the <caches /> section.
The configuration has a separate section for describing the cluster structure and a list of caches that will be stored in it.
<datatypes /> - a description of the types that will be stored in your cluster. You can use both built-in PHP types and custom ones. As you can see, for each type of objects we can write the name of the cache (and we can not even write if we don’t want to store such objects in a separate cache)
<cluster /> - description of the cluster structure and list of caches to be stored in it.
<caches /> describes the caches. The cache name must be unique; the
backup-count parameter determines how many cluster nodes you can lose without losing data. The more important the
backup count is , the more reliable your cluster is, but the more memory it consumes. You can also configure eviction (automatic removal of objects from the cache), more about this
on the wiki page<multicast /> defines the multicast address that will be used to build the cluster. As it is known, only class D networks are available for multicast (224.0.0.0 - 239.255.255.255)
<nodes /> describes the number and types of cluster nodes. Now there are only 2 types of nodes:
storage-only - is engaged only in storing data and executing internal requests
service - not only stores data, but also processes external requests, therefore for nodes of this type, you must specify the port on which requests from PHP clients will be received.
Code generation for integration with your application
To work efficiently, the cluster needs to generate code specific to your application (your domain model) and compile its Java part, since it works faster than accessing objects through reflection. To generate and compile all the necessary code, you need:
1) cd $deploymentFolder/sproot-grid/scripts 2) build.sh(or build.cmd)
where $ deploymentFolder is the directory into which you unpacked the distribution
The code generation should be performed only in case of changing the description of the domain model, i.e. if your model is stable, then you will have to perform this operation only once, then the generated php source files can be stored in the code repository, and the java part will be compiled into the library. Those. You do not need to generate anything 10 times before you appended your application, this is done only 1 time during the development phase.
After the code generation is completed, copy the folder with .php files from $ deploymentFolder / sproot-grid / php / org to the root of your application.
Launch
1) cd $deploymentFolder/sproot-grid/scripts 2) run.sh(run.cmd) nodeId memorySize
where nodeId is the value of the section id attribute in the configuration file,
memorySize - the amount of memory (in MB or GB) that you want to allocate to the node
For example:
run.sh 1 256m
or
run.cmd 2 2g
Use inside the application
In the code generation step, you got everything you need to integrate with your application. The only thing left is to copy this code into your application, for this, copy everything from the $ deploymentFolder / sproot-grid / php folder to the root of your application.
Everything! Now you can use the cluster from your application.
Code example: <?php require_once 'org/sproot_grid/SprootClient.php'; require_once 'some/package/User.php'; use org\sproot_grid\SprootClient; use some\package\User; $client = new SprootClient('localhost', 12345);
You can find the description of the API
here , but if in brief, the API is now like this:
- get ($ cacheName, $ key)
- getAll ($ cacheName, array $ keys)
- cacheSize ($ cacheName)
- cacheKeySet ($ cacheName)
- containsKey ($ cacheName, $ key)
- search ($ cacheName, $ fieldName, $ searchWord)
- remove ($ cacheName, $ key)
- removeAll ($ cacheName, array $ keys)
- put ($ cacheName, $ key, $ domainObject)
- putAll ($ cacheName, array $ domainObjects)
- clearCache ($ cacheName)
Conclusion
Sproot Grid is published under the MIT license.
SourcesWikiDistributive