📜 ⬆️ ⬇️

CRISP-DM: Proven Methodology for Data Scientists

The formulation of machine learning tasks is mathematically very simple. Any classification, regression, or clustering problem is essentially a normal optimization problem with constraints. Despite this, the existing variety of algorithms and methods for their solution makes the data analytics profession one of the most creative IT professions. In order for the solution to the problem not to turn into an endless search for a “golden” solution, but to be a predictable process, it is necessary to follow a fairly clear sequence of actions. Methodologies such as CRISP-DM describe this sequence of actions.

The methodology for analyzing CRISP-DM data is mentioned in many posts on Habré, but I could not find its detailed Russian-language descriptions and decided to fill this gap with my article. At the core of my material is the original description and an adapted description from IBM . An overview lecture on the benefits of using CRISP-DM can be found, for example, here .


* Crisp (English) - crisp, chips

I have been working for CleverDATA (part of the LANIT group) in the position of data scientist since 2015. We are engaged in projects in the field of big data and machine learning, mainly in the field of data-driven marketing (that is, marketing, built on "deep" analysis of customer data). We are also developing a 1DMP data management platform and a 1DMC data exchange . Our typical machine learning projects are the development and implementation of predictive (predictive) and prescriptive (recommending the best action) models to optimize key customer business indicators. In a number of similar projects we used the CRISP-DM methodology.
')
CRoss Industry Standard Process for Data Mining (CRISP-DM) is a standard that describes general processes and approaches to data analytics used in industrial data-mining projects, regardless of a specific task or industry.

On the well-known analytical portal kdnuggets.org, a survey is periodically published (for example, here ), according to which CRISP-DM regularly ranks first among the most popular data analysis methodologies, SEMMA goes further and KDD Process is the least used.

Source: kdnuggets.com

In general, these three methodologies are very similar to each other (it is difficult to come up with something fundamentally new here). However, CRISP-DM has earned popularity as the most complete and detailed. Compared to it, KDD is more general and theoretical, and SEMMA is simply the organization of functions for its intended purpose in the SAS Enterprise Miner tool and affects only the technical aspects of modeling, without any reference to the business formulation of the problem.

About methodology


The methodology was developed in 1996 on the initiative of three companies (the current DaimlerChrysler, SPSS and Teradata) and was further developed with the participation of 200 companies from various industries with experience in data-mining projects. All these companies used different analytical tools, but they all had a very similar process.

The methodology is actively promoted by IBM. For example, it is integrated into IBM SPSS Modeler (formerly SPSS Clementine).

An important feature of the methodology is paying attention to the company's business goals. This allows management to perceive data analysis projects not as a “sandbox” for experiments, but as a full-fledged element of the company's business processes.

The second feature is a fairly detailed documentation of each step. According to the authors, a well-documented process allows management to better understand the essence of the project, and for analysts to influence decision-making more.

According to CRISP-DM, an analytical project consists of six main steps, carried out sequentially:


The methodology is not rigid. It allows variation depending on the specific project - you can go back to the previous steps, you can skip some steps if they are not important for the task to be solved:

Wikipedia source

Each of these stages in turn is divided into tasks. The output of each task should have a certain result. The tasks are as follows:


Source Crisp_DM Documentation

In the description of the steps, I deliberately will not delve into mathematics and algorithms, since the article focuses on the process. I assume that the reader is familiar with the basics of machine learning, but just in case, the following section describes the basic terms.

Also note that the methodology is equally applicable both for internal projects, and for situations where the project is being done by consultants.

A few basic concepts of machine learning


As a rule, the main result of an analytical project is a mathematical model. What is a model?

Let a business have a certain quantity of interest - y (for example, the probability of a customer churn). And there is also data - x (for example, customer support calls), on which y can depend. The business wants to understand exactly how y depends on x , so that later on by setting x it can influence y . Thus, the task of the project is to find the function f, which best models the studied dependence y = f ( x ).

Under the model, we will understand the formula f (x) or the program that implements this formula. Any model is described, firstly, by its learning algorithm (it can be regression, decision tree, gradient boosting, etc.), and secondly, a set of its parameters (which are different for each algorithm). Model training is the process of finding the parameters at which the model best approximates the observed data.

A training set is a table containing x and y pairs. The rows in this table are called cases , and the columns are attributes . Attributes with sufficient predictive ability will be called predictors . In the case of non-teacher training (for example, in clustering problems), the training sample consists of x only. Scoring is the application of the found function f (x) to new data for which y is not yet known. For example, in the credit scoring task, the probability of late payment by the customer is modeled first, and then the model developed is applied to new applicants to assess their creditworthiness.

Step-by-step description of the methodology



A source

1. Business Analysis (Business Understanding)


In the first step, we need to determine the objectives and the scope of the project.

Project Objective (Business objectives)


First of all we get acquainted with the customer and try to understand what he really wants (or tell him). The following questions are good to get an answer.


1.2 Current situation (Assessing current solution)



A source

When, together with the customer, we figured out what we want, we need to evaluate what we can offer, taking into account current realities.

We evaluate whether there are enough resources for the project.


It is necessary to describe the probable risks of the project, as well as to determine a plan of action to reduce them.

Typical risks are as follows.


It is important that the customer and the performer speak the same language, so before starting the project it is better to compile a glossary and agree on the terminology used in the project. So, if we make an outflow model for telecom, it is necessary to immediately agree on what exactly we will consider as outflow - for example, the absence of significant charges on the account for 4 weeks in a row.

Next is (at least roughly) to assess the ROI. In machine-learning projects, a reasonable estimate of payback can often be obtained only after the completion of a project (or pilot simulation), but an understanding of the potential benefits can be a good driver for everyone.

1.3 Tasks to be solved from the point of view of analytics (Data Mining goals)


After the task is set in business terms, it is necessary to describe it in technical terms. In particular, we answer the following questions.


1.4 Project Plan


Once the answers to all the basic questions have been received and the purpose of the project is clear, it is time to make a project plan. The plan should contain an assessment of all six phases of implementation.

2. Data Analysis (Data Understanding)


We start the implementation of the project and first look at the data. At this step there is no modeling, only descriptive analytics is used.

The goal of the step is to understand the strengths and weaknesses of the data provided, determine their sufficiency, suggest ideas on how to use them, and better understand the customer's processes. To do this, we build graphs, make samples and calculate statistics.

2.1 Data collection



A source

First you need to understand what data the customer has. Data can be:


It is necessary to analyze all sources, access to which is provided by the customer. If your own data is not enough, it may be worth buying third-party data or organizing new data collection.

2.2 Data description


Next, look at the data available to us.


2.3 Data exploration


With the help of graphs and tables we examine the data in order to formulate hypotheses about how this data will help solve the problem.

In the mini-report we record what is interesting in the data, as well as a list of attributes that are potentially useful.

2.4 Data quality


It is important to assess the quality of the data before modeling, since any inconsistencies can affect the progress of the project. What could be the difficulty with the data?


3. Data Preparation



A source

Data preparation is traditionally the most time-consuming stage of machine learning of a project (in the description it’s about 50-70% of the project time, in our experience it can be even more). The purpose of the stage is to prepare a training set for use in modeling.

3.1 Data Selection


First you need to select the data that we will use to train the model.

Both attributes and cases are selected.

For example, if we make product recommendations to site visitors, we limit ourselves to analyzing only registered users.

When selecting data analyst answers the following questions.


3.2 Data Cleaning


When selected potentially interesting data, we check their quality.


The output is 3 attribute lists - quality attributes, corrected attributes and rejected.

3.3 Constructing new data


Often, feature engineering is the most important step in data preparation: a well-composed feature can significantly improve the quality of a model.

By generating data include:


3.4 Integrating data


Well, when the data is taken from a corporate repository (QCD) or a pre-prepared storefront. However, data often needs to be downloaded from several sources and their integration is required to prepare a training set. By integration is meant both a “horizontal” connection (Merge) and a “vertical” association (Append), as well as data aggregation. At the output, as a rule, we have a single analytical table, suitable for delivery to the analytical software as a training sample.

3.5 Formatting Data


Finally, you need to bring the data to a format suitable for modeling (only for those algorithms that work with a specific data format). So, if we are talking about the analysis of the time series - for example, we predict monthly sales of the trading network - it may be necessary to sort it out.

4. Modeling


At the fourth step, the most interesting thing finally begins - model training. As a rule, it is performed iteratively - we try different models, compare their quality, do a brute force of the hyperparameters and choose the best combination. This is the most enjoyable stage of the project.

4.1 Selecting the modeling technique


It is necessary to decide which models we will use (the good, a lot of them). The choice of model depends on the problem to be solved, attribute types and complexity requirements (for example, if the model is further implemented in Excel, then RandomForest and XGBoost obviously will not work). When choosing, pay attention to the following.


4.2 Generating a test design


Next, you need to decide on what we will teach, and on what to test our model.

The traditional approach is to divide the sample into 3 parts (training, validation and test) in an approximate 60/20/20 proportion. In this case, the training sample is used to adjust the model parameters, and the validation and test to get a quality assessment estimate cleared of the effect of retraining . More complex strategies involve the use of different options for cross-validation .

Here we also figure out how we will do the optimization of the hyperparameters of the models - how many iterations will be for each algorithm, whether we will do grid-search or random-search .

4.3 Building the models


We start the training cycle and after each iteration we fix the result. At the output we get several trained models.

In addition, for each model trained we fix the following.


4.4 Assessing the model



A source

After the model pool has been formed, it is necessary to analyze them again in detail and select the winning models. At the exit, it is not bad to have a list of models sorted by objective and / or subjective criteria.

Tasks step:


If the success criterion is not achieved, then you can either improve the current model or try a new one.

Before proceeding to the implementation you need to make sure that:


5. Evaluation of the result (Evaluation)



A source

The result of the previous step is a constructed mathematical model (model), as well as findings (findings). In the fifth step, we evaluate the results of the project.

5.1 Evaluation of simulation results (Evaluating the results)


If at the previous stage we evaluated the modeling results from a technical point of view, here we evaluate the results from the point of view of achieving business goals.

We address the following questions:


5.2 Review the process


It is worth getting together for a beer at the table, analyze the course of the project and formulate its strengths and weaknesses. To do this, go through all the steps:


5.3 Determining the next steps


Next, you need to either implement the model if it suits the customer, or, if there is a potential for improvement, try to improve it further.

If at this stage we have several satisfying models, then we select those that will be further implemented.

6. Deployment



A source

Before starting a project with the customer, the delivery method of the model is always specified. In one case, it could be just a scammed customer base, in the other - an SQL formula, in the third - a fully developed analytical solution integrated into the information system.

At this step, the model is implemented (if the project involves the implementation phase). Moreover, implementation can be understood as the physical addition of functionality, and the initiation of changes in the company's business processes.

6.1 Planning Deployment


Finally gathered in a heap all the results. Now what?


6.2 Setting up model monitoring (Planning Monitoring)


Very often, project support work is included in the project. This is what is being negotiated.


6.3 Report on the simulation results (Final Report)


At the end of the project, as a rule, a report is written on the results of the simulation, to which the results are added for each step, from the primary data analysis to the implementation of the model. Recommendations for the further development of the model can also be included in this report.

The written report is presented to the customer and all interested parties. In the absence of TK, this report is the main project document. It is also important to talk with the employees involved in the project (both on the part of the customer and the contractor) and gather their opinion on the project.

What about practice?


It is important to understand that the methodology is not a universal recipe. This is simply an attempt to formally describe a sequence of actions that, to one degree or another, is performed by any analyst involved in data analysis.

In CleverDATA, following methodology on data-mining projects is not a strict requirement, but, as a rule, when drawing up a project plan, our detailing quite accurately fits into this sequence of steps.

The methodology is applicable to completely different tasks. We followed it in a number of marketing projects, including when we predicted the likelihood of a trading network’s client to respond to an advertising offer, made a model for assessing the creditworthiness of a borrower for a commercial bank, and developed a product recommendation service for an online store.

One hundred and one report



A source

According to the authors, after each step a certain report should be written. However, in practice it is not very realistic. Like everyone else, we have projects when the customer sets very short deadlines and needs to get results quickly. It is clear that in such conditions it makes no sense to spend time on the detailed documentation of each step. All such intermediate information, if it is needed, we fix in such cases with a pencil “on a napkin”. This allows you to quickly engage in the implementation of the model and meet the deadlines.

In practice, many things are done much less formally than the methodology requires. For example, we usually do not waste time on the selection and coordination of the models used, but we immediately test all the available algorithms (of course, if resources allow). We do the same with attributes - we prepare several variants of each attribute at once, so that the maximum number of variants can be tested. With this approach, irrelevant attributes are eliminated automatically using feature selection algorithms - automatic determination of the predictive ability of attributes.

I suppose that the formalism of the methodology is explained by the fact that it was written back in the 90s, when there was not so much computing power and it was important to correctly plan each action. Now the availability and cheapness of "iron" simplifies many things.

About the importance of planning


There is always the temptation to "run" the first two stages and go directly to the implementation. Practice shows that this is not always justified.

At the stage of setting business goals (business understanding), it is important to talk as much as possible with the customer to the proposed solution and make sure that your expectations coincide with it. It so happens that a business expects to receive as a result of some kind of "magic" robot, which will immediately solve all its problems and instantly double the revenue. Therefore, so that no one has any disappointments on the results of the project, it is always worthwhile to clearly say what kind of result the customer will receive and what he will give to the business.

In addition, the customer may not always give the correct assessment of the accuracy of the model. As an example: suppose we analyze the response to an advertising campaign on the Internet. We know that about 10% of clients follow the link. The model developed by us selects 1000 clients most inclined to respond, and we see that among them every fourth follows the link - we get an accuracy (precision) of 25%. The model shows a good result (2.5 times better than the “random” model), but for the customer 25% accuracy is too low (he is waiting for numbers in the region of 80-90%). And on the contrary, a completely meaningless model, which classifies all in one class, will show accuracy (accuracy) of 90%, and formally will satisfy the stated criterion of success. Those.together with the customer, it is important to choose the correct measure of the quality of the model and correctly interpret it.

The data understanding stage is important in that it allows both us and the customer to better understand its data. We had examples when, after the presentation of the results of the step, we agreed in parallel with the main project about new ones, since the customer saw the potential in the patterns found at this stage.

As another example, I will cite one of our projects, when we relied on a dialogue with the customer, limited ourselves to a superficial study of the data and found at the modeling stage that some of the data was inapplicable due to numerous omissions. Therefore, it is always worthwhile to pre-examine the data with which to work.

Finally, I want to note that, despite its completeness, the methodology is still quite common. It says nothing about the choice of specific algorithms and does not provide ready-made solutions. Perhaps this is good, as there is always room for creative search, because, I repeat, today the data scientist profession remains one of the most creative in the IT field.

Source: https://habr.com/ru/post/328858/


All Articles