📜 ⬆️ ⬇️

Salesforce Data Migration with Pentaho Data Integration

Hi, Habr.

My name is Ilya Grebtsov, I work as a Java / JS Developer in DataArt. I want to share something useful with those who work with Salesforce.

In Salesforce, the task often arises of massively creating / changing / deleting a group of related records in several objects, analogs of tables in a relational database. For example, frequently used standard objects of the Account (information about the client’s company), Contact (information about the client itself). The problem is that when you save a Contact record, you must specify the Id of the associated Account record, that is, the account must exist at the time of adding the contact record.
')
In reality, connections can be even more complicated, for example, the Opportunity object refers to both Account and Contact. Plus, links to any non-standard (custom) objects are possible. In any case, the record by reference must be created before the record that refers to it.

Consider solutions to this problem:

Anonymous APEX


You need to prepare an APEX script, then execute it in the Salesforce Developer Console. In the script, related objects are filled in sequence. In the example below, the test Account is inserted, then Contact. When inserting Contact, the Account Id of the Account is used, which is obtained after inserting the Account.
Account [ ] accounts ;
accounts. add ( new Account (
Name = 'test'
) ) ;
insert accounts ;

Contact [ ] contacts ;
contacts. add ( new Contact (
AccountId = accounts [ 0 ] . Id ,
FirstName = 'test',
LastName = 'test'
) ) ;
insert contacts ;

Pros:


Minuses:


Thus, the method is only suitable for small, simple, manual changes.

Batch APEX


When you need to make changes to the set of records that are already inside Salesforce, you can use Batch APEX. Unlike the previous one, this method allows you to process up to 10,000 records, according to Salesforce Limits. Batch is a custom class inherited from Database.Batchable, written in APEX.

You can run the class manually from the Developer Console:
Database. Batchable < sObject > batch = new myBatchClass ( ) ;
Database. executeBatch ( batch ) ;

Or create a Job, with which the process will start at a specific time.

Thus, the method is suitable for large-scale data changes within Salesforce, but it is very laborious. When deploying from a sandbox to a production class, like any other APEX code, it should be covered with a unit test.

Data loader


Data Loader is a standard Salesforce utility installed locally. Allows you to process up to 5 million records. Migration with the Data Loader is the best practice and the most popular method of processing a large number of records. Uploading / uploading records is done using the Salesforce API.

The utility allows you to select an object in Salesforce and export the data to a CSV file. And vice versa, download the object from CSV to Salesforce.

Processing existing data in Salesforce is as follows:
  1. Upload required data from Salesforce to CSV files.
  2. Modify data in CSV files.
  3. Uploading data to Salesforce from CSV files.


Point 2 here is a bottleneck that cannot be implemented by the Data Loader itself. It is necessary to create third-party procedures for processing CSV files.

As an example, in order to insert several contact records, the data must go to the associated Account and Contact. The action algorithm should be as follows:
  1. Prepare a CSV file with a list of new Account entries. Download to Salesforce using Data Loader. The result is a list of Account IDs.
  2. Prepare a CSV file with a list of new Contact entries. In it in the field AccountId you must specify the ID from the list obtained in step 1. This can be done manually, or use any programming language.
  3. Download the received CSV with the Contacts list to Salesforce.


Thus, the method is suitable for large-scale data changes, both within Salesforce, and using external data. But it is very laborious, especially if it is necessary to modify the records.

Pentaho data integration


Pentaho Data Integration, also known as Kettle, is a universal ETL utility. Not a specialized Salesforce utility. The set contains Salesforce Input- and Output-connection methods, which allows Salesforce to transparently process data as data from other sources: relational databases, SAP, XML files, CSV, and others.

With Salesforce, the utility works through the Salesforce API, so it is possible to process up to 5 million records, as with the Data Loader. Only in a more convenient way.

The main distinguishing feature is the graphical interface. The whole transformation is divided into separate simple steps: read the data, sort, join, write data. Steps are displayed in the form of pictograms between which arrows are drawn. Thus, it is clearly seen that where it comes from and where it comes from.

There are at least two versions of the utility: paid with guaranteed support and free. The free Community Edition (Apache License v2.0) can be downloaded at http://community.pentaho.com/ .

Development of the transformation in the simplest case does not require programming skills. But if you want, you can use steps that include routines written in Java or JavaScript.

Features of data migration using Pentaho Data Integration should be covered in more detail. Here I will describe my experience and difficulties encountered.

Connection parameters to Salesforce should be specified in the Transformation Properties parameters. Once made settings will be available in all steps, where necessary, in the form of variables.
I recommend indicating:

For security reasons, you can leave the login and password fields empty, in which case the Data Integration will prompt them when the transformation is started.



Data sampling is carried out by the Salesforce Input step. In the settings of this step, you need to specify the connection parameters, in this case, the variables created earlier are used. Also select an object and a list of fields to select, or specify a specific query using the SOQL query language (similar to the SQL query language used in relational databases).



Data insertion is performed using one of several Output steps:

It is also possible to delete entries using


As in the input step, you need to specify the connection parameters, in this case variable transformations are used. Here there is a more fine-tuning - the parameters of time-out connections, after which the transformation will fail. And the Salesforce-specific parameter Batch Size is the number of records transferred in one transaction. Increasing the Batch Size slightly increases the speed of the transformation, but cannot be more than 200 (according to Salesforce restrictions). In addition, if there are triggers performing additional data processing after insertion, unstable operation with a large Batch Size value is possible. The default is 10.



These two steps fully cover the capabilities of the Data Loader utility. Everything in between is the logic of data processing. And it can be implemented directly in Pentaho Data Integration.

For example, one of the most requested steps is the joining of two data streams. That join from SQL which so does not suffice in SOQL. Here he is.
In the settings, it is possible to select the type: Inner, Left Outer, Right Outer, Full Outer - and specify connection keys.



Mandatory requirement - input data for this step must be sorted by key fields. In Data Integration, a separate Sorter step is used for this.

Sorting is done in RAM, however, it is possible that it is not enough and the data will be stored in an intermediate file on the disk. Most sorter settings are associated with this case. Ideally, you should avoid swapping to disk: this is ten times slower than sorting in memory. To do this, you need to adjust the parameter Sort size - specify the upper limit of the number of lines that can theoretically pass through the sorter.

Also in the settings you can select one or more key fields, which will be sorted in ascending or descending order of values. If the field type is string, it makes sense to specify case sensitivity. The Presorted parameter shows that the rows are already sorted by this field.



Join and Sorter form a bundle that occurs in almost every transformation.



Sorting and joining have the most effect on transformation performance. It is necessary to avoid unnecessary sorting, if the data is already sorted a few steps earlier and their order after did not change. But you need to be careful: if in Join the data comes unsorted, Data Integration does not interrupt the work and does not show an error, just the result will be incorrect.

As key fields you should always choose a short field. Data Integration allows you to select several key fields for sorting and joining, but the processing speed is significantly reduced. As a workaround, it is better to generate a surrogate key, as a result there will be only one field for the connection. In the simplest case, a surrogate key can be obtained by concatenating strings. For example, to connect to the fields FirstName, LastName is better to connect via FirstName + '' + LastName. If you go further, from the resulting line you can calculate the hash (md5, sha2). Unfortunately, there is no built-in step in Data Integration to calculate the string hash, you can write it yourself using the User Defined Java Class.

In addition to the above steps, Data Integration includes many others. These are filters, switch, union, steps for processing rows, lookup to relational tables and web services. And many others. As well as two universal steps, allowing to execute code in Java or JavaScript. I will not dwell on them in detail.

The unpleasant feature of Data Integration with Salesforce is the slow speed of inserting records through the Salesforce API. About 50 records per second (as with the standard Data Loader, accessing the web service itself is a slow operation), which makes it difficult to process thousands of lines. Fortunately, in Data Integration it is possible to organize insertion into several streams. There is no standard solution, here's what I could come up with:



Here the JavaScript procedure generates a random stream number. Next step Switch distributes the streams according to its number. Four separate Salesforce Insert steps insert records, thus increasing the overall flow rate to 200 records per second. Ultimately, all inserted records with a filled ID field are saved to a CSV file.

Using parallel insertion, you can speed up data processing. But it’s not possible to produce threads infinitely: according to the limitations of Salesforce, maybe no more than 25 open connections from one user.

The resulting transformation can be immediately run on the local machine. Progress is displayed in Step Metrics. Here you can see what steps are working, how many records were read in this step and passed on. As well as the speed of processing records at a particular step, which makes it easy to find the bottleneck of the transformation.



For regular transformations, Data Integration allows you to create a job that runs on a condition or schedule on a local machine or a dedicated server.

Thanks for attention. I hope Salesforce-developers will take such a useful tool in service.

Source: https://habr.com/ru/post/258927/


All Articles