“If you can cache everything in a very efficient way , then you can often change the rules of the game.”
We, software developers, often face problems that require the distribution of a certain set of data that does not correspond to the name “big data”. Examples of this type of problem are as follows:
- Product Metadata in the Online Store
- Document Metadata in the Search Engine
- Movie and TV Show Metadata
Faced with this, we usually choose one of two ways:
- Storing this data in a centralized repository (for example, a relational database management system, a NoSQL data warehouse, or a memcached cluster) for remote user access
- Serialization (such as json, XML, etc.) and distribution to consumers who will store a local copy
The application of each of these approaches has its own problems. Data centralization may allow your data set to grow indefinitely, however:
- There are delays and bandwidth limitations when interacting with this data.
- Remote information store can not be compared in reliability with a local copy of data
On the other hand, serialization and storing a local copy of data completely in RAM can provide shorter delay times and higher access frequency, however, this approach also brings with it problems related to scaling, which become more complex as data sets increase:
')
- The amount of dynamic memory occupied by the data set is growing
- Receiving a data set requires loading additional bits
- Updating a dataset may require significant CPU resources or interfere with automatic memory management.
Developers often choose a hybrid approach - they cache data locally with frequent access and use it remotely - with rare data. This approach has its own problems:
- Managing data structures may require a significant amount of dynamic cache.
- Objects are often stored for a long time so that they can be distributed, and negatively affect the operation of automatic memory management.
At Netflix, we realized that such a hybrid approach often creates only the illusion of winning. The size of the local cache is often the result of a careful search for a compromise between the delay in remote access for many records and the storage requirement (hip) when locally storing more data. But if you can cache
everything in a very efficient way , then you can often change the game — and keep the entire data set in memory using a smaller hip and loading less CPU than storing only a portion of that set. And here comes Hollow, the latest open source project from Netflix.
Hollow is a Java library and a comprehensive set of tools for using in-memory small and medium-sized datasets that are distributed from one manufacturer to many consumers with read-only access.
“Hollow is changing the approach ... Data sets for which such an opportunity could never even be considered before can now be candidates for Hollow.”
Functioning
Hollow focuses solely on its prescribed set of problems: storing a
whole set of read-only data in the memory of consumers. It overcomes the effects of updating and unloading data from a partial cache.
Due to its performance, Hollow
changes the approach from the position of the size of the corresponding data sets to solve using RAM. The data sets for which such a possibility could never even be considered before can now be candidates for Hollow. For example, Hollow may be completely acceptable for data sets, which, if presented in json or XML, would require more than 100 GB.
Quick adaptation
Hollow doesn’t just improve performance - this package greatly enhances the team’s quick adaptation when working with tasks related to data sets.
Using Hollow is simple right from the first step. Hollow automatically creates a custom API based on a specific data model, so that users can intuitively interact with the data, taking advantage of the performance of the IDE code.
But a serious gain occurs when using Hollow on an ongoing basis. If your data resides in Hollow, then many possibilities appear. Imagine that you can quickly go through the entire production dataset - current or from any point in the recent past, right up to the local development workstation: load it and then
accurately reproduce certain production scenarios.
Choosing Hollow will give you an edge on the toolkit; Hollow comes with many off-the-shelf utilities to ensure that your data sets are understood and handled.
Resilience
How many reliability nines would you like to have? Three four five? Nine? Being a local in-memory data storage, Hollow is not exposed to environmental problems, including network failures, disk problems, interference from neighbors in a centralized data storage, etc. If your data producer fails or your consumer cannot connect to the data store, then you can work with outdated data - but the data will still be present, and your service will still work.
Hollow has been refined for more than two years of continuous tough use on Netflix. We use it to provide the important data sets needed to perform interactions in Netflix, on servers that quickly serve real-time client requests at maximum performance or close to it. Despite the fact that Hollow expends enormous efforts to squeeze every last bit of performance out of the server hardware, great attention to detail has become part of strengthening this critical part of our infrastructure.
A source
Three years ago, we
announced Zeno , our current solution in this area. Hollow replaces Zeno, but is in many ways his spiritual successor.
Zeno concepts regarding producer, consumer, data state, object status copies, and state changes have gone to HollowAs before, the time sequence of a changing data set can be broken down into discrete
data states , each of which is a complete copy of the data state at a specific point in time. Hollow automatically detects a change in state — the effort required from the user to maintain an updated state is minimal. Hollow automatically deduplicates data to minimize the heap size of our consumer datasets.
Development
Hollow takes these concepts and develops them, improving almost every aspect of the solution.
Hollow avoids using POJOs as an in-memory representation — instead replaces them with compact, fixed-length, strongly typed data encryption. Such encryption is intended both to minimize the amount of dynamic memory of data sets, and to reduce the fraction of the cost associated with the CPU, while providing access to data in real time. All encrypted entries are packed into reusable memory blocks, which are located in the JVM heap to prevent an impact on the operation of automatic memory management on working servers.
Example of location in the memory of OBJECT recordsHollow data sets are self-contained - for a serialized blob, in order for the blob to be used by the framework, maintenance is not required from the code specific to the use case. Additionally, Hollow is designed
with backward compatibility , so deployment can occur less frequently.
"The ability to build powerful access systems , regardless of whether they were originally intended to develop a data model."
Since Hollow is completely built on RAM, the toolkit can be implemented provided that random access across the entire width of the data set can be performed without leaving the Java heap. A lot of
off-the-shelf tools are part of Hollow, and building your own tools thanks to the basic building blocks provided by the library is simple.
The basis for using Hollow is the concept of
indexing data in
various ways . This provides O (1) -access to the corresponding records in the data, which makes it possible to build powerful access systems regardless of whether they were originally intended for the development of the data model.
Benefits
The Hollow toolkit is easy to install and has intuitive controls. You will be able to understand in your data some aspects of which you did not even suspect.
The change tracking tool allows you to track the change of certain records in time.Hollow can enhance your capabilities. If something in a record seems to be wrong, you can find out exactly what happened and when, using a simple query in the change tracking tool. If an accident occurs and an improper data set is accidentally released, then your data set can be
rolled back to the one before the error, fixing production problems on their paths. Since the transition between states occurs quickly, this action can produce results on the entire fleet of vehicles in a few seconds.
“If your data resides permanently in Hollow, then many possibilities appear.”
Hollow proved to be an extremely useful tool on Netflix - we saw a ubiquitous decrease in server startup times and a decrease in the amount of dynamic memory occupied with constantly increasing demand for metadata. Thanks to targeted data modeling efforts, carried out according to the results of detailed
analysis of dynamic memory usage , which became possible with the advent of Hollow, we will be able to further improve performance.
In addition to the performance gain, we see a huge performance increase related to the distribution of our catalog data. This is due in part to the tools that Hollow has, and in part to the architecture that would have been impossible without it.
Conclusion
Wherever we see a problem, we see that it can be solved with Hollow. Today, Hollow is available for worldwide use.
Hollow is not designed to work with data sets of any size. If there is a lot of data, then saving the entire set of data in memory is not possible. However, with proper structure and some data modeling, this threshold can be much higher than you think.
Documentation is available at
http://hollow.how , and the code is on
GitHub . We recommend that you refer to the
quick start guide - it only takes a few minutes to view the demo and see the work, and it takes about an hour to become familiar with the full industrial implementation of Hollow. After that, you can connect your data model and - go ahead.
If you started working, you can get help directly from us or from other users via
Gitter or Stack Overflow, by placing the label “hollow”.