This article is a kind of master class “DVC for automating ML experiments and data versioning”, which took place on June 18 at the ML REPA (Machine Learning REPA:
Reproducibility, Experiments and Pipelines Automation) at our bank site.
Here I will talk about the features of the internal work of DVC and how to use it in projects.
The code examples used in the article are available
here . The code was tested on MacOS and Linux (Ubuntu).
')
Content
Part 1
Part 2
DVC setup
Data Version Control is a tool that is designed to manage model and data versions in ML projects. It is useful both at the experimental stage and for deploying your models into operation.

DVC allows you to version models, data and pipelines in DS projects.
The source is
here .
Let's look at the DVC operation using the example of the iris color classification problem. To do this, I will use the well-known dataset
Iris Data Set . Other
examples of working with DVC are shown by Jupyter Notebook.
What should be done:So, we clone the repository, create a virtual environment and install the necessary packages. Installation and launch instructions are in the README repository.
1. Clone this repository
git clone https://gitlab.com/7labs.ru/tutorials-dvc/dvc-1-get-started.git cd dvc-1-get-started
2. Create and activate virtual environment
pip install virtualenv virtualenv venv source venv/bin/activate
3. Install python libraries (including dvc)
pip install -r requirements.txt
To install DVC, use the
pip install dvc
command. After installation, it is necessary to initialize the DVC in the
dvc init
project folder, which will generate a set of folders for further work of the DVC.
4. checkout new branch in demo repository (to not wipe content of master branch)
git checkout -b dvc-tutorial
5. Initialize DVC
dvc init commit dvc init git commit -m "Initialize DVC"
DVC runs on top of Git, uses its infrastructure, and has similar syntax.
In the process, DVC creates meta files to describe pipelines and versioned files that need to be saved in Git the history of your project. Therefore, after executing
dvc init
you need to run
git commit
to commit all the settings made.
The
.dvc
folder will appear in your repository, in which
cache
and
config
will lie.
The contents of
.dvc
will look like this:
./ ../ .gitignore cache/ config
Config is the DVC configuration, and cache is the system folder into which DVC will store all the data and models that you will version.
DVC will also create a
.gitignore
file, in which it will write those files and folders that do not need to be committed to the repository. When you transfer a file to DVC for versioning in Git, versions and metadata will be saved, and the file itself will be stored in cache.
Now you need to install all the dependencies, and then make a
checkout
in the new
dvc-tutorial
branch, in which we will work. And download the Iris dataset.
Get data
wget -P data/ https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
DVC Features
Versioning Models and Data
The source is
here .
Let me remind you that if you transfer some data under the control of DVC, then it will begin to track all changes. And we can work with this data in the same way as with Git: save the version, send it to the remote repository, get the right version of the data, change and switch between versions. The interface at DVC is very simple.
Enter the
dvc add
command and specify the path to the file that we need to version. DVC will create the iris.csv metafile with the extension .dvc, and write information about it to the cache folder. Let's commit these changes so that information about the beginning of versioning appears in the Git history.
dvc add data/iris.csv
Inside the generated dvc file, its hash with standard parameters is stored.
Output
- the path to the file in the dvc folder, which we added under the control of DVC. The system takes the data, puts it in the cache and creates a link to the cache in the working directory. This file can be added to the Git history and thus versioned. DVC takes over the management of the data itself. The first two characters of the hash are used as the folder inside the cache, and the remaining characters are used as the name of the created file.
Automation of ML PipelinesIn addition to data version control, we can create pipelines (pipelines) - chains of calculations between which dependencies are defined. Here is the standard pipeline for classifier training and assessment:

At the input, we have data that must be pre-processed, divided into train and test, calculate the characteristics and only then train the model and evaluate it. This pipeline can be broken into separate pieces. For example, to distinguish the stage of loading and preprocessing data, splitting data, evaluating, etc., and connecting these chains together.
To do this, the DVC has a wonderful
dvc run
command, in which we pass certain parameters and specify the Python module that we need to run.
Now - for example, the launch phase of the calculation of signs. First, let's look at the contents of the featureization.py module:
import pandas as pd def get_features(dataset): features = dataset.copy()
This code takes the dataset, calculates the characteristics and saves them in iris_featurized.csv. We left the calculation of additional signs to the next stage.
To create a pipeline, it is necessary to execute the command for each stage of calculations
dvc run
.

First, in the
dvc run
command, specify the name of the stage_feature_extraction.dvc metafile, in which the DVC will write the necessary metadata about the calculation stage. Through the
-d
argument, we specify the necessary dependencies: the featureization.py module and the iris.csv data file. We also specify the iris_featurized.csv file, in which the signs are saved, and the python src / featurization.py launch command itself.
dvc run -f stage_feature_extraction.dvc \ -d src/featurization.py \ -d data/iris.csv \ -o data/iris_featurized.csv \ python src/featurization.py
DVC will create a metafile and will track changes in the Python module and the iris.csv file.
If changes occur in them, the DVC will restart this calculation step in the pipeline.
The resulting stage_feature_extraction.dvc file will contain its hash, start command, dependencies and output (there are additional parameters for them that can be found in the metadata).
Now you need to save this file in the history of Git commits. Thus, we can create a new branch and push it into the Git repository. You can commit to a Git story either by creating each stage individually, or all stages at once.
When we build such a chain for our entire experiment, DVC builds a computation graph (DAG), by which it can start either recalculation of the entire pipeline or some part. The hashes of the output from one stage go to the inputs of another. According to them, DVC tracks dependencies and builds a graph of calculations. If you changed the code somewhere in split_dataset.py, the DVC will not load the data and possibly recalculate the signs, but will restart this stage and the subsequent training and evaluation stages.
Metrics trackingUsing the
dvc metrics show
command, you can display the metrics of the current launch, the branch in which we are located. And if we pass the
-a
option, the DVC will show all the metrics that are in the Git history. In order for DVC to start tracking metrics, when creating the evaluate step, we pass the
-m
parameter through data / eval.txt. The evaluate.py module writes metrics to this file, in this case
f1
and
confusion metrics
. In the output folder in the dvc file of this step,
cache
and
metrics
set to true. That is, the dvc metrics show command will output the contents of the eval.txt file to the console. Also, with the arguments of this command, you can show only
f1_score
or only
confusion_matrix
.

In this example, we got these results:
Pipeline reproducibilityThose who have worked with this dataset know that it is very difficult to build a good model on it.
Now we have a pipeline created using DVC. The system tracks the history of data and the model, can restart itself in whole or in part, and can display metrics. We have completed all the necessary automation.
We had a model with f1 = 0.78. We want to improve it by changing some parameters. To do this, restart the entire pipeline, ideally, with just one command. In addition, if you are working in a team, you may want to pass on the model and code to colleagues so that they can continue to work on them.
The
dvc repro
allows
dvc repro
to restart pipelines or individual stages (in this case, you need to specify the reproduced stage after the command).
dvc repro stage_evaluate
, the stage will try to restart the entire pipeline. But if we do this in the current state, the DVC will not see any changes and will not restart. And if we change something, he will find the change and restart the pipeline from this point on.
$ dvc repro stage_evaluate.dvc Stage 'data/iris.csv.dvc' didn't change. Stage 'stage_feature_extraction.dvc' didn't change. Stage 'stage_split_dataset.dvc' didn't change. Stage 'stage_train.dvc' didn't change. Stage 'stage_evaluate.dvc' didn't change. Pipeline is up to date. Nothing to reproduce.
In this case, the DVC did not see any changes in the stage_evaluate stage dependencies and refused to restart. And if we specify the
-f
option, then it will restart all the preliminary steps and show a warning that it deletes previous versions of the data that it was tracking. Each time the DVC restarts the stage, it deletes the previous cache, actually overwrites it so as not to duplicate data. At the moment the DVC file is launched, its hash will be checked, and if it has changed, the pipeline will restart and overwrite all the output that this pipeline has. If you want to avoid this, you must first run a specific version of the data in some remote repository.
The ability to restart pipelines and track the dependencies of each stage allows you to experiment with models faster.
For example, you can change the characteristics ('uncomment' the lines for calculating the characteristics in
featurization.py
). DVC will see these changes and restart the entire pipeline.
Saving data to a remote repository
DVC can work not only with local version storage. If you execute the
dvc push
command, the DVC will send the current version of the model and data to a pre-configured remote repository repository. If then your colleague makes the
git clone
your repository and performs
dvc pull
, he will receive the version of the data and models that is intended for this branch. The main thing is that everyone has access to this repository.

In this case, we simulate the “remote” storage in the temp / dvc folder. In approximately the same way, remote storage is created in the cloud. Commit this change so that it remains in the Git story. Now we can perform
dvc push
to send data to this storage, and your colleague simply
dvc pull
to receive them.
So , we examined three situations in which DVC and basic functionality are useful:
- Versioning data and models . If you do not need pipes and remote repositories, you can version the data for a specific project, working on the local machine. DVC allows you to quickly work with data in tens of gigabytes.
- Exchange of data and models between teams . You can use cloud solutions to store data. This is a convenient option if you have a distributed team or there are restrictions on the size of files sent by mail. Also, this technique can be used in situations when you send each other a Notebook, but they do not start.
- Organization of team work inside a large server . The team can work with the local version of big data, for example, several tens or hundreds of gigabytes, so that you do not copy them back and forth, but use one remote storage that will send and save only critical versions of models or data.
Part 2
How to implement DVC in your projects?To ensure reproducibility of the project, certain requirements must be observed.
Here are the main ones:
- all pipelines are automated;
- control of launch parameters of each stage of calculations;
- version control of code, data and models;
- environmental control;
- documentation.
If all this is done, then the project is more likely to be reproducible. DVC allows you to fulfill the 3 first requirements in this list.
When trying to implement DVC in your company, you may encounter reluctance: “Why do we need this? We have a Jupyter Notebook. ” Perhaps some of your colleagues only work with Jupyter Notebook, and it is much more difficult for them to write such pipelines and code in the IDE. In this case, you can go through a step-by-step implementation.
- The easiest way to start is by versioning the code and models.
And then move on to automating the pipelines. - First automate the steps that often restart and change,
and then the whole pipeline.
If you have a new project and a couple of enthusiasts in a team, then it is better to use DVC right away. So, for example, it turned out in our team! When starting a new project, my colleagues supported me, and we started using DVC on our own. Then they began to share with other colleagues and teams. Someone picked up our undertaking. Today, DVC is not yet a generally accepted tool in our bank, but it is used in several projects.