One of the main news sensations of the last days is the signing of the so-called “
Spring Package ”. The highlight of the program is the requirement to store traffic for up to six months and metadata for three years. This legislative initiative has become a full-fledged laws that will have to abide by. In fact, we are talking about the ubiquitous creation of constantly updated archives, which requires the solution of all sorts of issues, including those related to the selection of suitable data storage systems.
Archiving traffic and metadata does not imply the need for quick access. This information will not be used in the daily activities of companies, it should only be stored in case of a request from the authorities. This greatly facilitates the task of creating archives, allowing you to simplify their architecture.
In general, the archive should perform five main functions:
')
- Saving data for future use.
- Ensuring constant user access to stored data.
- Ensuring confidentiality of access.
- Reducing the load on the working systems due to the transfer to the archive of static data.
- Using data retention policies.
The traffic and metadata transmitted by users are characterized by a wide variety and lack of structure. Add to this a huge amount and speed of information growth, especially typical for large companies and services, and you’ll get the classic definition of big data. In other words, changes in legislation require a massive solution to the problem of storing large data, constantly and with great speed entering archive systems. Therefore, archives should not only be high-performance, but also have excellent horizontal scalability so that, if necessary, you can relatively simply increase the storage capacity.
Big Data Archiving
To solve such problems, the
EMC InfoArchive platform is
intended ; we
wrote about it a year ago. This is a flexible and powerful enterprise-class archiving platform, which is a combination of a storage system (NAS or SAN) and a software archiving platform.
The benefits of InfoArchive include:
- Support for international standards, including open standards XML and OAIS (Open Archival Information System)
- high security of stored data
- ease of managing large volumes of structured and unstructured data,
- as well as ample opportunities for configuration and scaling.
InfoArchive consists of the following components:
- Web application : the main application that provides easy access to most of the settings and functions of the system.
- Server : Archiving Services for Web Server.
- XML repository (xDB) : storage services for InfoArchive Server. The xDB base is included in the InfoArchive distribution and is automatically installed as part of the system kernel.
- Shell [optional] : A command-line tool for performing administrative tasks, adding data, managing, and querying objects.
- The framework for adding data [optional] .
InfoArchive high-level architecture:
Depending on security, licensing, and other considerations, InfoArchive can be installed on a single host or distributed across multiple hosts. But in general, when creating a repository, it is recommended to use the most simple architecture, this allows reducing the delay in data transfer.
The logical architecture of InfoArchive is as follows:
Servers can scale vertically, or “specialize” in various functions of REST services — data addition, search, administration, etc. This allows you to implement any degree of scalability, increasing the archive in accordance with the needs. And the load distribution is easily performed using the “classic” HTTP balancer.
xDB
One of the key components of InfoArchive is the automatically deployable xDB DBMS. Its properties largely determine the capabilities of the entire system, so let's take a closer look at this component.
In xDB, XML documents and other data are stored in an integrated, scalable, high-performance object-oriented database. This DBMS is written in Java and allows you to save and manipulate very large amounts of data with high speed. The xDB transaction system satisfies the
ACID rules: atomicity, consistency, isolation, and durability.
Physically, each database consists of one or more
segments . Each segment is distributed in one or several
files , and each file consists of one or several
pages .
The relationship between physical and logical xDB structures:
The back-end application server in xDB is the so-called
page server (page server) that transfers data pages to frontend applications (client applications). In environments where the database is accessed from a single application server, performance is usually better if the page server is running within the same JVM along with the application server.
If the page server performs other tasks, then it is called an
internal server (internal server). And if no additional tasks are assigned to the page server, then it is called
a dedicated server. A dedicated server in combination with a TCP / IP connection between it and clients has better scalability than the internal server. The larger the archive sizes, the more different front-end applications access the page server, the more arguments there are to make it a dedicated server.
XDB clustering
xDB can be deployed both on a single node and on a cluster using the shared-nothing architecture using
Apache Cassandra . You don't have to understand Cassandra, although knowledge of its basics will make it easier to configure and manage the cluster.
XDB clustering is performed using horizontal scaling, in which data is physically distributed across multiple servers (data nodes). They do not have a common file system and interact with each other over the network. Clusters also use configuration nodes containing complete information about the cluster structure.
Both a single node and a cluster act as a logical database container. A page server works with a node's data directory — a structure containing one or more databases. Each node contains its own page server, and all of these servers are combined into a cluster through the configuration node.
Thanks to the xDB clustering capabilities, InfoArchive users can work with big data arriving at high speed. This is especially true for the constant recording of user traffic of large web services and telecom providers: a single node can rest against the capabilities of the processor and / or the capacity of the disk subsystem.
Cluster example:
An xDB cluster consists of three types of components:
- Data nodes Information repositories. Each node acts as a separate backend server that listens and is able to accept requests from other cluster members. The client driver (s) allows the cluster to be represented as if it consists of a single node. In this case, the driver should not be connected directly to the data nodes, only through configuration nodes.
- Configuration nodes . They store metadata containing information about all data nodes: databases, segments, files, users, groups, and data distribution across the node. If configuration nodes fail, the cluster dies. Therefore, the contents of these nodes should be duplicated.
- Drivers clients . Remote xDB drivers initialized using the bootstrap URL by one of the configuration nodes. Applications interact with the cluster through a client driver, for example, to create new databases, move information between nodes, add new XML documents, etc. The client driver automatically sends requests for data to the corresponding data nodes, focusing on information from configuration nodes.
Data Partitioning
Data distribution in xDB is done at the detachable library level. A disconnected library is a data block that is moved within a cluster. They are usually used for partitioning. Each library is always stored or tied to a single data node. In this case, the node can store libraries linked to other nodes.
Also disable libraries allow you to deal with the imbalance of cluster performance. For example, if some data nodes in a cluster are overloaded, then some libraries can be transferred to other nodes.
Possible application of data partitioning in InfoArchive:
Use EMC Isilon for EMC InfoArchive
As a storage for InfoArchive, you can use EMC Isilon - a horizontally scalable network storage system that provides the functionality of the entire enterprise and allows you to effectively manage the growing amounts of historical data. The EMC Isilon cluster architecture allows both to combine ease of use and reliability in one system, as well as to ensure linear growth of volumes and system performance.
Initially, the system may be relatively small, but over time, its size may increase significantly. The solution based on EMC Isilon allows you to start building a system from 18 TB and grow to 50 PB in a single file system. With increasing volumes, only the number of nodes in the cluster grows. A single point of administration and a single file system is maintained. Thus, EMC Isilon can be effectively used as a single consolidated storage system and used throughout the entire life cycle of information storage. This approach avoids the use of various storage systems, which simplifies the implementation, maintenance, expansion and modernization of the system. As a result, the cost of operating and maintaining a single EMC Isilon solution is significantly lower than that of a solution consisting of several traditional systems.
Thus, the EMC Isilon storage system effectively complements the InfoArchive archive platform.
Main advantages:
- Integration with SmartLock software provides compatibility with a single-write, read-multiple (WORM) scheme at the database and storage levels to prevent unintended, premature or malicious alteration or deletion of data.
- Cost reduction and infrastructure optimization. Isilon's horizontally scalable NAS provides storage utilization for more than 80%, while Isilon SmartDedupeTM software reduces storage requirements by an additional 35%.
- EMC Isilon, based on EMC Federated Business Data Lake, supports the HDFS protocol, so all types of data in InfoArchive can be provided for all types of Hadoop.
- EMC Isilon enables you to quickly add storage without downtime, manually migrating data or reconfiguring application logic, saving valuable IT resources and reducing transaction costs.
Life examples
At the end of the article we will give a couple of examples of using InfoArchive to create responsible archiving systems.
Example 1 The company processes financial transactions, the data comes to the archive as compressed XML, in 12 simultaneous streams, about 20 million operations per day (about 320 GB XML), 3.6 million / 56 GB per hour, 1017 operations / 16 MB in give me a sec.
Load when adding data and search performance does not depend on the number of already archived objects. Discriminant query processing takes about 1 second (1 result), non-discriminant processing - 4 seconds (200 results).
Example 2 At another customer, the data goes to the archive in the form of AFP documents, structured and unstructured data.
Average / peak load per day:
- 0.5 / 4 million documents
- 20/60 million records
- 50,000 / 70,000 search operations
System performance when adding data to the archive:
- 1.5 million documents per hour in 12 simultaneous streams (about 60% of the time is spent converting from AFP to PDF),
- or 45 million structured entries per hour in 10 simultaneous streams.
This represents less than 0.5% of the theoretical maximum performance of the EMC Centera storage system used in the project.
- The average document search time is 0.5 seconds.
- The average time to receive a document is 1.5 seconds.
- The average search time for structured data among 1 billion records is 2.5 seconds.
- Up to 15,000 searches per hour.
Conclusion
Signing the “Spring package” is a difficult test for the entire IT and telecom sector. Nevertheless, the difficult task of creating numerous high-speed and voluminous archives can be solved with the help of
EMC InfoArchive - bundles of storage systems and software platform based on xDB, which has wide opportunities for scaling and configuring.