⬆️ ⬇️

Archiving as an answer to the “Spring package”





One of the main news sensations of the last days is the signing of the so-called “ Spring Package ”. The highlight of the program is the requirement to store traffic for up to six months and metadata for three years. This legislative initiative has become a full-fledged laws that will have to abide by. In fact, we are talking about the ubiquitous creation of constantly updated archives, which requires the solution of all sorts of issues, including those related to the selection of suitable data storage systems.



Archiving traffic and metadata does not imply the need for quick access. This information will not be used in the daily activities of companies, it should only be stored in case of a request from the authorities. This greatly facilitates the task of creating archives, allowing you to simplify their architecture.



In general, the archive should perform five main functions:

')



The traffic and metadata transmitted by users are characterized by a wide variety and lack of structure. Add to this a huge amount and speed of information growth, especially typical for large companies and services, and you’ll get the classic definition of big data. In other words, changes in legislation require a massive solution to the problem of storing large data, constantly and with great speed entering archive systems. Therefore, archives should not only be high-performance, but also have excellent horizontal scalability so that, if necessary, you can relatively simply increase the storage capacity.



Big Data Archiving



To solve such problems, the EMC InfoArchive platform is intended ; we wrote about it a year ago. This is a flexible and powerful enterprise-class archiving platform, which is a combination of a storage system (NAS or SAN) and a software archiving platform.



The benefits of InfoArchive include:





InfoArchive consists of the following components:







InfoArchive high-level architecture:







Depending on security, licensing, and other considerations, InfoArchive can be installed on a single host or distributed across multiple hosts. But in general, when creating a repository, it is recommended to use the most simple architecture, this allows reducing the delay in data transfer.



The logical architecture of InfoArchive is as follows:







Servers can scale vertically, or “specialize” in various functions of REST services — data addition, search, administration, etc. This allows you to implement any degree of scalability, increasing the archive in accordance with the needs. And the load distribution is easily performed using the “classic” HTTP balancer.



xDB



One of the key components of InfoArchive is the automatically deployable xDB DBMS. Its properties largely determine the capabilities of the entire system, so let's take a closer look at this component.



In xDB, XML documents and other data are stored in an integrated, scalable, high-performance object-oriented database. This DBMS is written in Java and allows you to save and manipulate very large amounts of data with high speed. The xDB transaction system satisfies the ACID rules: atomicity, consistency, isolation, and durability.



Physically, each database consists of one or more segments . Each segment is distributed in one or several files , and each file consists of one or several pages .



The relationship between physical and logical xDB structures:







The back-end application server in xDB is the so-called page server (page server) that transfers data pages to frontend applications (client applications). In environments where the database is accessed from a single application server, performance is usually better if the page server is running within the same JVM along with the application server.



If the page server performs other tasks, then it is called an internal server (internal server). And if no additional tasks are assigned to the page server, then it is called a dedicated server. A dedicated server in combination with a TCP / IP connection between it and clients has better scalability than the internal server. The larger the archive sizes, the more different front-end applications access the page server, the more arguments there are to make it a dedicated server.



XDB clustering



xDB can be deployed both on a single node and on a cluster using the shared-nothing architecture using Apache Cassandra . You don't have to understand Cassandra, although knowledge of its basics will make it easier to configure and manage the cluster.



XDB clustering is performed using horizontal scaling, in which data is physically distributed across multiple servers (data nodes). They do not have a common file system and interact with each other over the network. Clusters also use configuration nodes containing complete information about the cluster structure.



Both a single node and a cluster act as a logical database container. A page server works with a node's data directory — a structure containing one or more databases. Each node contains its own page server, and all of these servers are combined into a cluster through the configuration node.



Thanks to the xDB clustering capabilities, InfoArchive users can work with big data arriving at high speed. This is especially true for the constant recording of user traffic of large web services and telecom providers: a single node can rest against the capabilities of the processor and / or the capacity of the disk subsystem.



Cluster example:







An xDB cluster consists of three types of components:





Data Partitioning



Data distribution in xDB is done at the detachable library level. A disconnected library is a data block that is moved within a cluster. They are usually used for partitioning. Each library is always stored or tied to a single data node. In this case, the node can store libraries linked to other nodes.



Also disable libraries allow you to deal with the imbalance of cluster performance. For example, if some data nodes in a cluster are overloaded, then some libraries can be transferred to other nodes.



Possible application of data partitioning in InfoArchive:







Use EMC Isilon for EMC InfoArchive



As a storage for InfoArchive, you can use EMC Isilon - a horizontally scalable network storage system that provides the functionality of the entire enterprise and allows you to effectively manage the growing amounts of historical data. The EMC Isilon cluster architecture allows both to combine ease of use and reliability in one system, as well as to ensure linear growth of volumes and system performance.



Initially, the system may be relatively small, but over time, its size may increase significantly. The solution based on EMC Isilon allows you to start building a system from 18 TB and grow to 50 PB in a single file system. With increasing volumes, only the number of nodes in the cluster grows. A single point of administration and a single file system is maintained. Thus, EMC Isilon can be effectively used as a single consolidated storage system and used throughout the entire life cycle of information storage. This approach avoids the use of various storage systems, which simplifies the implementation, maintenance, expansion and modernization of the system. As a result, the cost of operating and maintaining a single EMC Isilon solution is significantly lower than that of a solution consisting of several traditional systems.



Thus, the EMC Isilon storage system effectively complements the InfoArchive archive platform.



Main advantages:





Life examples



At the end of the article we will give a couple of examples of using InfoArchive to create responsible archiving systems.



Example 1 The company processes financial transactions, the data comes to the archive as compressed XML, in 12 simultaneous streams, about 20 million operations per day (about 320 GB XML), 3.6 million / 56 GB per hour, 1017 operations / 16 MB in give me a sec.







Load when adding data and search performance does not depend on the number of already archived objects. Discriminant query processing takes about 1 second (1 result), non-discriminant processing - 4 seconds (200 results).



Example 2 At another customer, the data goes to the archive in the form of AFP documents, structured and unstructured data.



Average / peak load per day:





System performance when adding data to the archive:





This represents less than 0.5% of the theoretical maximum performance of the EMC Centera storage system used in the project.









Conclusion



Signing the “Spring package” is a difficult test for the entire IT and telecom sector. Nevertheless, the difficult task of creating numerous high-speed and voluminous archives can be solved with the help of EMC InfoArchive - bundles of storage systems and software platform based on xDB, which has wide opportunities for scaling and configuring.

Source: https://habr.com/ru/post/305940/



All Articles