Generator of large transaction graphs with criminal activity patterns

Good day.

A couple of years ago, our team (compliance at a Swiss bank) faced a very interesting task - it was necessary to generate a large transaction graph between customers, companies and ATMs, add patterns similar to money laundering and other criminal activities to this graph, and also add a minimum information about the nodes of this graph - names, addresses, time, etc. Of course, all data had to be generated from scratch, without using existing customer data.

To solve this problem, a generator was written, which I would like to share with you. Under the cat you will find a story explaining why we needed it, and a description of the generator. For the impatient - here lies the code . I would be glad if someone would benefit from our experience.

Why do we do this nonsense?

Our team decided to participate as sponsors on the LauzHack hackathon

. One of the conditions of participation in the format of the sponsor was the provision of a real business task for the participants. Just at that time, we had a very interesting project related to automating the search for financial crimes and money laundering among the transactions of our clients, and without thinking twice, we decided to offer the same task to the hackathon participants.

For obvious reasons, we could not use real data, so we had to create them. To make the task as close as possible to reality, we looked at the statistics of real data and tried, as best we could, to bring the generated data to real distributions, and also did not skimp on the amount and complexity of data - we did not need a solution working on a graph of 100 nodes and 200 connections, we were looking for a solution capable of processing graphs of millions of nodes and billions of connections, and taking into account all the available information about nodes and connections.

What we did

And it turned out we have quite a quick (adjusted for the amount of data), an interesting and configurable generator! Let's understand in detail

Data types

We want to have a graph of financial transactions, respectively, the possible participants in this graph are:

Customer - one can say the account of an abstract bank customer. Described by name, email, age, work activities, political views, nationality, education and address
A company is a business entity in the financial system. Determined by company type, name and country.
ATM - roughly speaking, the point of exit of money from the graph under our control. Determined by geographic coordinates.
Transaction - The fact of the transfer of money from one node of the graph to another. Determined by start and end node, amount, currency and time.

To create this data, we use Mimesis , an excellent library for creating fake data.

Creating a graph: basic entities

First you need to create all the basic entities - customers, companies and ATMs. The script takes the number of customers you want to create, and on the basis of this calculates the number of companies and ATMs. According to our data, the number of companies having any large number of transactions with customers is about 2.5% of the number of customers, and the number of ATMs is 0.05% of the number of customers. These values are very generalized and non-configurable (wired in the generator code).

All information is saved in .csv files. Writing to these files occurs in batch, k lines at a time. This value is set by script arguments. Also, three types of nodes are generated in parallel.

Creating a graph: connections between entities

After creating the base entities, we begin to connect them together. At this stage, we have not yet generated the transactions themselves, but simply the fact of the connection between the nodes. This is done to speed up the process of generating the entire graph and works approximately as follows: if the two nodes are connected, then we generate a certain number of transactions between them, scattered over time. If not connected, but there is no transaction between these nodes.

The likelihood of a connection between two nodes is configured through arguments, standard values are listed below.

Possible types of connections:

Client -> Client (p = 0.4%)
Client -> Company (p = 1%)
Customer -> ATM (p = 3%)
Company -> Client (p = 0.5%)

Like nodes, all types of connections are generated in parallel and are written to their files in batch.

Create graph: transactions

Having graph nodes and connections between them that fall under the desired distribution, we can start generating transactions. The process is quite simple in itself, but it is quite difficult to parallelize it. Therefore, at this stage there are only two independent streams - transactions originating from the client and transactions originating from the company.

Nothing particularly interesting at this stage does not happen: the script runs through the list of connections and for each connection generates a random number of transactions. All this is written exactly the same - in .csv files by packages.

Count Creations: Patterns

And here there are interesting moments. The types of patterns of behavior that we wanted to get in the final graph:

Flow - a large amount goes from one node to m others, each of these m nodes transfers money to the next level of n nodes, and so on, until the last level sends all the money to one recipient.
Circular - the amount of money goes in a circle and returns to the source.
Time - a certain amount of money moves from one node to another with some fixed frequency.

Let's look at each of these patterns in more detail:

Flow

To begin with, the number of levels through which money will have to pass is selected. In our implementation, this is a random number between 2 and 6, is not configurable and is wired into the code. Next, select two nodes of the graph - the sender and the recipient. A random amount is also selected, which the sender will send to the recipient (according to a clever formula: 50000 * random() + 50000 * random() ).

Each member of this network takes some kind of fee for their services. In our implementation, the maximum price for passing money through the network will be 10% of the amount transferred by the sender.

The generated transactions have a time shift relative to the transactions of the previous network level - that is, the money first comes to the n-1 level, and only then goes to the n level. Delays are randomly selected within 4-5 days. Also, the generated transactions have pseudo-random amounts (limited by the initial amount and taking into account the fees to each node)

Circular

It is generated according to a similar principle as Flow, only instead of a different sender and receiver and several levels in this pattern, the money goes in a circle and returns to the original node. All intermediate nodes charge, as is the case with Flow, and transactions also have a time shift.

Time

The easiest pattern. A certain amount is sent from the sender to the recipient a random number of times (from 5 to 50, not configurable) with pseudo-random shifts in time.

All new transactions are written in the same way into packages in .csv files.

Randomization of the graph and collection of all transactions in one file

At this stage, we have several .csv files:

3 files with nodes (clients, companies and ATMs)
4 transaction files: one for regular transactions and 3 containing patterns.

An additional script mixes pattern transactions along with regular transactions so that it is not possible to see the patterns in the graph in the order in which the transactions are written to the file.

And what to do with all this?

In the end, we have 4 beautiful files with nodes of the graph and transactions between them. You can import into Neo4J, you can distribute through REST, but everything your heart can do with them.

As for us, we received very positive feedback from the participants of the hackathon, and some very interesting solutions for finding patterns in massive graphs.

Source: https://habr.com/ru/post/447626/

All Articles