📜 ⬆️ ⬇️

HBase, loading large data arrays through bulk load

Hello colleagues.
I want to share my experience using HBase, namely, to tell about bulk loading . This is another method for loading data. It is fundamentally different from the usual approach (writing to the table through the client). It is believed that with the bulk load you can very quickly load huge data arrays. This is what I decided to understand.

And so, first things first. Loading through bulk load occurs in three stages:




')
In this case, I needed to feel this technology and understand it in numbers: what is the speed, how it depends on the number and size of files. These numbers are too dependent on external conditions, but they help to understand the orders between normal loading and bulk load.

Initial data:


Cluster running Cloudera CDH4, HBase 0.94.6-cdh4.3.0.
Three virtual hosts (on the hypervisor), in the configuration CentOS / 4CPU / RAM 8GB / HDD 50GB
Test data was stored in CSV files of various sizes, with a total volume of 2GB, 3.5GB, 7.1GB and 14.2GB
First, about the results:

Bulk loading


Speed:


Size of one record (row): 0.5Kb
MapReduce Job Initialization Time: 70 sec
Downloading files to HDFS from the local file system:


Download via clients:


Loading was carried out from 2 hosts on 8 flows on each.
Clients were launched over the crown at the same time, the CPU load did not exceed 40%
The size of one record (row), as in the previous case, was equal to 0.5Kb.



What is the result?




I decided to implement this test in the wake of talking about bulk load as a method of ultrafast data loading. It should be said that the official documentation deals only with reducing the load on the network and the CPU. Anyway, I do not see a gain in speed. Tests show that bulk load is only one and a half times faster, but let's not forget that this is without taking into account the initialization of m / r Joba. In addition, the data must be delivered in HDFS, it will also take some time.
I think it’s worthwhile to treat bulk load simply as another way to load data, architecturally different (in some cases, very convenient).

And now about the implementation


Theoretically, everything is quite simple, but in practice there are several technical nuances.

//  Job job = new Job(configuration, JOB_NAME); job.setJarByClass(BulkLoadJob.class); job.setMapOutputKeyClass(ImmutableBytesWritable.class); job.setMapOutputValueClass(Put.class); job.setMapperClass(DataMapper.class); job.setNumReduceTasks(0); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(HFileOutputFormat.class); FileInputFormat.setInputPaths(job, inputPath); HFileOutputFormat.setOutputPath(job, new Path(outputPath)); HTable dataTable = new HTable(jobConfiguration, TABLE_NAME); HFileOutputFormat.configureIncrementalLoad(job, dataTable); // ControlledJob controlledJob = new ControlledJob( job, null ); JobControl jobController = new JobControl(JOB_NAME); jobController.addJob(controlledJob); Thread thread = new Thread(jobController); thread.start(); . . . //   output setFullPermissions(JOB_OUTPUT_PATH); //  bulk-load LoadIncrementalHFiles loader = new LoadIncrementalHFiles(jobConfiguration); loader.doBulkLoad( new Path(JOB_OUTPUT_PATH), dataTable ); 




Therefore, you must run Job on behalf of the user hbase or distribute rights to the output files (this is how I did).



 //    - HTableDescriptor descriptor = new HTableDescriptor( Bytes.toBytes(tableName) ); descriptor.addFamily( new HColumnDescriptor(Constants.COLUMN_FAMILY_NAME) ); HBaseAdmin admin = new HBaseAdmin(config); byte[] startKey = new byte[16]; Arrays.fill(startKey, (byte) 0); byte[] endKey = new byte[16]; Arrays.fill(endKey, (byte)255); admin.createTable(descriptor, startKey, endKey, REGIONS_COUNT); admin.close(); 




In general, that's all. I want to say that this is a rather rough test, without tricky optimizations, so if you have something to add, I will be glad to hear.

All project code is available on GitHub: github.com/2anikulin/hbase-bulk-load

Source: https://habr.com/ru/post/195040/


All Articles