PyTorch Tour

Hi, Habr!

Before the end of May, we will publish a translation of the book by Francois Chollet, Deep Learning in Python (examples using the Keras and Tensorflow libraries). Do not miss!

')
But we, naturally, are looking into the impending future and are starting to look closely at the even more innovative PyTorch library. Today, we offer you a translation of the article by Peter Goldsboro, which is ready to arrange a ~~long walk~~ for you to ~~take a~~ tour of this library. Under the cut a lot and interesting.

For the last two years I have been seriously engaged in TensorFlow - I wrote articles on this library, gave lectures on expanding its backend, or used it in my own research related to deep learning. For this work, I learned quite well what were the strengths and what were the weaknesses of TensorFlow - and also got acquainted with specific architectural solutions that leave the field for competition. With such baggage, I recently joined the PyTorch team in the department of artificial intelligence research at Facebook (FAIR) - perhaps this is currently the strongest competitor to TensorFlow. Today, PyTorch is quite popular in the research community; why - I will tell in the following paragraphs.

In this article I want to give a quick overview of the PyTorch library, explain why it was created and introduce you to its API.

General picture and philosophy

First, let's look at what PyTorch represents from a fundamental point of view, which programming model you have to use when working with it, and how it fits into the ecosystem of modern deep learning tools:

In essence, PyTorch is a Python library that provides tensor computations with GPU acceleration, like NumPy. On top of this, PyTorch offers a rich API for solving applications related to neural networks.

PyTorch differs from other machine learning frameworks in that it does not use static calculation graphs — defined in advance, immediately and finally — as in TensorFlow, Caffe2 or MXNet . In contrast, the calculated graphs in PyTorch are dynamic and are determined on the fly. Thus, with each invocation of layers in the PyTorch model, a new calculated graph is dynamically determined. This graph is created implicitly - that is, the library itself writes the flow of data going through the program, and links the function calls (nodes) together (via edges) into the calculated graph.

Comparison of dynamic and static graphs

Let's take a closer look at how static graphs differ from dynamic ones. In general, in most programming environments, when adding two variables x and y, meaning numbers, their total value is obtained (the result of addition). For example, in Python:

In [1]: x = 4 In [2]: y = 2 In [3]: x + y Out[3]: 6

But not in TensorFlow. In TensorFlow, x and y are not numbers per se, but descriptors of nodes of the graph representing these values, but not explicitly containing them. Moreover (more importantly), adding x and y will not result in the sum of these numbers, but in a descriptor of the calculated graph, which will give the desired value only after it is executed:

 In [1]: import tensorflow as tf In [2]: x = tf.constant(4) In [3]: y = tf.constant(2) In [4]: x + y Out[4]: <tf.Tensor 'add:0' shape=() dtype=int32>

In principle, when we write TensorFlow code, it’s not actually programming, but metaprogramming . We write a program (our code), which, in turn, creates another program (the TensorFlow calculation graph). Naturally, the first programming model is much simpler than the second. It is much more convenient to speak and reason in the context of real phenomena , rather than their ideas .

The most important advantage of PyTorch is that its execution model is much closer to the first paradigm than to the second. At its core, PyTorch is the most common Python with support for tensor calculations (like NumPy), but with GPU-accelerated tensor operations and, most importantly, with built-in automatic differentiation (AD). Since most modern machine learning algorithms are heavily dependent on data types from linear algebra (matrices and vectors) and use gradient information to refine the estimates, these two pillars of PyTorch are enough to cope with arbitrarily large-scale machine learning problems.

Returning to the simple case above, we can see that programming with PyTorch feels like “natural” Python:

 In [1]: import torch In [2]: x = torch.ones(1) * 4 In [3]: y = torch.ones(1) * 2 In [4]: x + y Out[4]: 6 [torch.FloatTensor of size 1]

PyTorch is slightly different from the basic logic of Python programming in one particular aspect: the library records the execution of a running program. That is, PyTorch quietly “tracks” what operations you perform on its data types, and behind the scenes - again! - collects the calculated graph. Such a calculated graph is needed for automatic differentiation, since it must pass in the opposite direction along the chain of operations that yielded the resulting value in order to calculate the derivatives (for reverse automatic differentiation). A serious difference between this calculated graph (or rather, the method of assembling this calculated graph) from the version of TensorFlow or MXNet is that the new graph is assembled “greedily” on the fly, when interpreting each code fragment.

On the contrary, in Tensorflow, the calculated graph is constructed only once, the metaprogram (your code) is responsible for this. Moreover, whereas PyTorch dynamically traverses the graph in the opposite direction whenever you request a derived value, TensorFlow simply inserts additional nodes into the graph, which (implicitly) calculate these derivatives and are interpreted exactly like all other nodes. Here the difference between dynamic and static graphs is manifested especially clearly.

Choosing which graphs to work with — static or dynamic — seriously simplifies the programming process in one of these environments. The control flow is an aspect that is particularly affected by this choice. In an environment with static graphs, the control flow must be represented at the graph level as specialized nodes. For example, in Tensorflow, to provide branching, there is an operation tf.cond() , which accepts three subgraphs as input: a conditional subgraph and two subgraphs for two development branches of a condition: if and else . Similarly, loops in Ternsorflow columns should be represented as tf.while() operations, taking the condition and the body subgraph as input. In a situation with a dynamic graph, all this is simplified. Since the graphs for each interpretation are viewed from the Python code as is, flow control can be natively implemented in the language using if conditions and while loops, as in any other program. Thus, the clumsy and confused Tensorflow code:

 import tensorflow as tf x = tf.constant(2, shape=[2, 2]) w = tf.while_loop( lambda x: tf.reduce_sum(x) < 100, lambda x: tf.nn.relu(tf.square(x)), [x])

Turns into natural and understandable PyTorch code:

 import torch.nn from torch.autograd import Variable x = Variable(torch.ones([2, 2]) * 2) while x.sum() < 100: x = torch.nn.ReLU()(x**2)

Naturally, in terms of ease of programming, the use of dynamic graphs is far from limited. Just being able to check intermediate values using print instructions (and not using tf.Print() nodes) or in the debugger is a big plus. Of course, dynamism can both optimize programmability as well as degrade performance — that is, it is more difficult to optimize such graphs. Therefore, the differences and trade-offs between PyTorch and TensorFlow are much the same as between a dynamic interpreted language, for example, Python, and a static compiled language, for example, C or C ++. The first is easier and faster to work with, and from the second and third it is more convenient to collect entities that are well optimizable. This is the trade-off between flexibility and performance.

PyTorch API Note

I want to make a general comment about the PyTorch API, especially regarding the calculation of neural networks compared to other libraries, for example, TensorFlow or MXNet - this API is hung with many modules (the so-called “batteries-included”). As one of my colleagues noted, API Tensorflow didn’t really go beyond the “assembly” level, in the sense that this API provides only the simplest assembly instructions needed to create computed graphs (addition, multiplication, pointwise functions, etc. d.). But it is deprived of the “standard library” for the most common program fragments, which the programmer has to reproduce thousands of times when working. Therefore, to build higher-level APIs on top of Tensorflow, you have to rely on the help of the community.
Indeed, the community has created such high-level APIs. True, unfortunately, not one, but a dozen - in the order of rivalry. Thus, on an unsuccessful day, you can read five articles on your specialization - and in all five you can find different frontends for TensorFlow. As a rule, there is very little in common between these APIs, so in fact you will have to study 5 different frameworks, and not just TensorFlow. Here are some of the most popular of these APIs:

PyTorch, in turn, is already equipped with the most common elements needed for daily research in the field of deep learning. In principle, it has a “native” Keras-like API in the torch.nn package, which provides the linkage of high-level modules of neural networks.

Place PyTorch in a shared ecosystem

So, explaining how PyTorch differs from static graph frameworks like MXNet, TensorFlow or Theano, I must say that PyTorch is actually not unique in its approach to computing neural networks. Before PyTorch, other libraries already existed, for example, Chainer or DyNet , which provided a similar dynamic API. However, today PyTorch is more popular than these alternatives.

In addition, PyTorch is not the only framework used in Facebook. The main workload in the production we now have is Caffe2 - this is a static graph framework built on the basis of Caffe . To make friends with the flexibility that PyTorch gives the researcher, with the advantages of static graphs in the field of production optimization, Facebook also develops ONNX , a kind of information exchange between PyTorch, Caffe2 and other libraries, for example, MXNet or CNTK .

Finally, a small historical digression: before PyTorch, Torch existed - a very old (sample from the early 2000s) library for scientific computing, written in the Lua language . Torch wraps a code base written in C, which makes it fast and efficient. In principle, PyTorch wraps exactly the same code base in C (but with an additional intermediate level of abstraction ), and exposes the API to the user in Python. Next, let's talk about this API in Python.

Work with PyTorch

Next, we will discuss the basic concepts and key components of the PyTorch library, examine its basic data types, automatic differentiation mechanisms, specific functionality related to neural networks, as well as utilities for loading and processing data.

Tensors

The most fundamental data type in PyTorch is tensor . The tensor data type is very similar in meaning and function to the ndarray of NumPy. Moreover, since PyTorch is focused on reasonable interoperability with NumPy, the API tensor also resembles the ndarray API (but is not identical to it). PyTorch tensors can be created using the torch.Tensor constructor, which accepts tensor dimensions as input and returns a tensor that occupies an uninitialized memory region:

 import torch x = torch.Tensor(4, 4)

In practice, most often you have to use one of the following PyTorch functions that return tensors initialized in some way, for example:

torch.rand : values are initialized from a random uniform distribution,
torch.randn : values are initialized from a random normal distribution,
torch.eye(n) : n×nn×n identity matrix,
torch.from_numpy(ndarray) : PyTorch tensor based on ndarray from NumPy
torch.linspace(start, end, steps) : 1-D tensor with values of steps evenly distributed between start and end ,
torch.ones : single-unit tensor,
torch.zeros_like(other) : a tensor of the same shape as the other and with only zeros,
torch.arange(start, end, step) : 1-D tensor with values filled out of range.

Similar to ndarray of NumPy, PyTorch tensors provide a very rich API for combination with other tensors, as well as for situational changes. As in NumPy, unary and binary operations can usually be performed using functions from the torch module, for example, torch.add(x, y) or directly using methods in tensor objects, for example, x.add(y) . For the most common places, there are overload operators, for example, x + y . Moreover, for many functions, there are situational alternatives that will not create a new tensor, but change the recipient instance. These functions are named the same as the standard variants, however, they contain an underscore in the title, for example: x.add_(y) .

Selected operations:

torch.add(x, y) : elementwise addition
torch.mm(x, y) : matrix multiplication (not matmul or dot ),
torch.mul(x, y) : elementwise multiplication
torch.exp(x) : elementwise exponent
torch.pow(x, power) : elementwise exponentiation
torch.sqrt(x) : squaring by element
torch.sqrt_(x) : situational, elementwise squaring
torch.sigmoid(x) : elementwise sigmoid
torch.cumprod(x) : the product of all values
torch.sum(x) : the sum of all values
torch.std(x) : standard deviation of all values
torch.mean(x) : average of all values

Tensors largely support semantics familiar from ndarray from NumPy, for example, translation, complex (whimsical) indexing ( x[x > 5] ) and element-wise relational operators ( x > y ). PyTorch tensors can also be converted directly to ndarray NumPy using the torch.Tensor.numpy() function. Finally, since the main superiority of PyTorch tensors over ndarray NumPy is GPU acceleration, you also have a feature torch.Tensor.cuda() , which copies the tensor memory to a GPU device with CUDA support, if any.

Autograd

At the heart of most modern machine learning techniques is the calculation of gradients. This is especially true for neural networks, where the back-propagation algorithm is used to update the weights. That is why Pytorch has a strong native support for gradient computing of functions and variables defined inside the framework. Such a technique in which gradients are automatically calculated for arbitrary calculations is called automatic (sometimes algorithmic ) differentiation.

In frameworks that use a static model for calculating graphs, automatic differentiation is implemented by analyzing the graph and adding additional computational nodes to it, where the gradient of one value relative to another is calculated step by step and the chain rule connecting these additional gradient nodes with edges is collected piece by piece.

However, in PyTorch there are no statically calculated graphs, therefore, here we cannot afford the luxury of adding gradient nodes after the remaining calculations are defined. Instead, PyTorch has to record or track the flow of values through the program as they arrive, that is, dynamically build a calculated graph. As soon as such a graph is written, PyTorch will have the information needed to reverse the flow of such a calculation and calculate the gradients of the output values based on the input.

Tensor PyTorch Tensor does not yet have full mechanisms for participating in automatic differentiation. To be able to record the tensor, it must be wrapped in torch.autograd.Variable . The Variable class provides almost the same API as Tensor , but it complements its ability to interact with torch.autograd.Function precisely for the sake of automatic differentiation. More precisely, the history of operations on the Tensor recorded in the Variable .

Using torch.autograd.Variable very simple. You just need to pass a Tensor to it and tell torch whether this variable requires writing gradients:

 x = torch.autograd.Variable(torch.ones(4, 4), requires_grad=True)

The function requires_grad may require a value of False , for example, when entering data or working with labels, since such information is usually not differentiated. However, they still need to be Variables in order to be suitable for automatic differentiation. Note: requires_grad is False by default, therefore, for the parameters to be trained, it should be set to True .

To calculate gradients and perform automatic differentiation, the backward() function is applied to the Variable . This calculates the gradient of this tensor relative to the leaves of the calculated graph (all input values that have influenced this). These gradients are then assembled into a member of a grad class Variable :

 In [1]: import torch In [2]: from torch.autograd import Variable In [3]: x = Variable(torch.ones(1, 5)) In [4]: w = Variable(torch.randn(5, 1), requires_grad=True) In [5]: b = Variable(torch.randn(1), requires_grad=True) In [6]: y = x.mm(w) + b # mm = matrix multiply In [7]: y.backward() # perform automatic differentiation In [8]: w.grad Out[8]: Variable containing: 1 1 1 1 1 [torch.FloatTensor of size (5,1)] In [9]: b.grad Out[9]: Variable containing: 1 [torch.FloatTensor of size (1,)] In [10]: x.grad None

Since all Variable except input values are results of operations, each Variable is associated with grad_fn , which is a function torch.autograd.Function for calculating the inverse step. For input values, it is None :

 In [11]: y.grad_fn Out[11]: <AddBackward1 at 0x1077cef60> In [12]: x.grad_fn None torch.nn

The torch.nn module provides torch.nn users with functionality specific to neural networks. One of its most important members is torch.nn.Module , representing a reusable block of operations and associated (trained) parameters, most often used in layers of neural networks. Modules may contain other modules and implicitly receive the backward() function for back distribution. An example of a module is torch.nn.Linear() , which represents a linear (dense / fully connected) layer (ie, an affine transform Wx+bWx+b ):

 In [1]: import torch In [2]: from torch import nn In [3]: from torch.autograd import Variable In [4]: x = Variable(torch.ones(5, 5)) In [5]: x Out[5]: Variable containing: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size (5,5)] In [6]: linear = nn.Linear(5, 1) In [7]: linear(x) Out[7]: Variable containing: 0.3324 0.3324 0.3324 0.3324 0.3324 [torch.FloatTensor of size (5,1)]

When training, it is often necessary to call backward() in a module in order to calculate gradients for its variables. Since when calling backward() , the Variables member of the Variables , there is also the nn.Module.zero_grad() method, which drops the grad member of all Variable to zero. Your training loop usually calls zero_grad() at the very beginning, or just before calling backward() , to reset the gradients for the next optimization step.

When writing your own models for neural networks, you often have to write your own subclasses of the module to encapsulate the common functionality that you want to integrate with PyTorch. This is done very simply - we inherit the class from torch.nn.Module and give it the forward method. For example, here is the module that I wrote for one of my models (in it Gaussian noise is added to the input information):

 class AddNoise(torch.nn.Module): def __init__(self, mean=0.0, stddev=0.1): super(AddNoise, self).__init__() self.mean = mean self.stddev = stddev def forward(self, input): noise = input.clone().normal_(self.mean, self.stddev) return input + noise

You can use the container torch.nn.Sequential() , which is transferred a sequence of modules, to connect or hook modules into full-featured models - and it, in turn, begins to act as an independent module, each call sequentially calculating those modules that were passed to it. For example:

 In [1]: import torch In [2]: from torch import nn In [3]: from torch.autograd import Variable In [4]: model = nn.Sequential( ...: nn.Conv2d(1, 20, 5), ...: nn.ReLU(), ...: nn.Conv2d(20, 64, 5), ...: nn.ReLU()) ...: In [5]: image = Variable(torch.rand(1, 1, 32, 32)) In [6]: model(image) Out[6]: Variable containing: (0 ,0 ,.,.) = 0.0026 0.0685 0.0000 ... 0.0000 0.1864 0.0413 0.0000 0.0979 0.0119 ... 0.1637 0.0618 0.0000 0.0000 0.0000 0.0000 ... 0.1289 0.1293 0.0000 ... ⋱ ... 0.1006 0.1270 0.0723 ... 0.0000 0.1026 0.0000 0.0000 0.0000 0.0574 ... 0.1491 0.0000 0.0191 0.0150 0.0321 0.0000 ... 0.0204 0.0146 0.1724

Losses

torch.nn also provides a number of loss functions, naturally important for machine learning applications. Examples of such functions:

torch.nn.MSELoss : mean square loss function
torch.nn.BCELoss : loss function of binary cross-entropy,
torch.nn.KLDivLoss : loss function of Kullback-Leibler information discrepancy

In the context of PyTorch, loss functions are often referred to as criteria . In essence, the criteria are very simple modules that can be parameterized immediately after creation, and from this point on can be used as ordinary functions:

 In [1]: import torch In [2]: import torch.nn In [3]: from torch.autograd import Variable In [4]: x = Variable(torch.randn(10, 3)) In [5]: y = Variable(torch.ones(10).type(torch.LongTensor)) In [6]: weights = Variable(torch.Tensor([0.2, 0.2, 0.6])) In [7]: loss_function = torch.nn.CrossEntropyLoss(weight=weights) In [8]: loss_value = loss_function(x, y) Out [8]: Variable containing: 1.2380 [torch.FloatTensor of size (1,)]

Optimizers

After the "primary elements" of neural networks ( nn.Module ) and loss functions, it remains to consider only the optimizer, which triggers a stochastic gradient descent (variant). PyTorch torch.optim , , :

torch.optim.SGD : ,
torch.optim.Adam : ,
torch.optim.RMSprop : , Coursera,
torch.optim.LBFGS : ---

-, parameters() nn.Module , , . , . For example:

 In [1]: import torch In [2]: import torch.optim In [3]: from torch.autograd import Variable In [4]: x = Variable(torch.randn(5, 5)) In [5]: y = Variable(torch.randn(5, 5), requires_grad=True) In [6]: z = x.mm(y).mean() # Perform an operation In [7]: opt = torch.optim.Adam([y], lr=2e-4, betas=(0.5, 0.999)) In [8]: z.backward() # Calculate gradients In [9]: y.data Out[9]: -0.4109 -0.0521 0.1481 1.9327 1.5276 -1.2396 0.0819 -1.3986 -0.0576 1.9694 0.6252 0.7571 -2.2882 -0.1773 1.4825 0.2634 -2.1945 -2.0998 0.7056 1.6744 1.5266 1.7088 0.7706 -0.7874 -0.0161 [torch.FloatTensor of size 5x5] In [10]: opt.step() #  y     Adam In [11]: y.data Out[11]: -0.4107 -0.0519 0.1483 1.9329 1.5278 -1.2398 0.0817 -1.3988 -0.0578 1.9692 0.6250 0.7569 -2.2884 -0.1775 1.4823 0.2636 -2.1943 -2.0996 0.7058 1.6746 1.5264 1.7086 0.7704 -0.7876 -0.0163 [torch.FloatTensor of size 5x5]

PyTorch , . torch.utils.data module . :

Dataset , ,
DataLoader , , , .

torch.utils.data.Dataset __len__ , , , __getitem__ . , , :

 import math class RangeDataset(torch.utils.data.Dataset): def __init__(self, start, end, step=1): self.start = start self.end = end self.step = step def __len__(self, length): return math.ceil((self.end - self.start) / self.step) def __getitem__(self, index): value = self.start + index * self.step assert value < self.end return value

__init__ - . __len__ , __getitem__ , __getitem__ , , .

, , , for i in range __getitem__ . , , , for sample in dataset . , DataLoader . DataLoader , . , , . DataLoader num_workers . : DataLoader , batch_size . A simple example:

 dataset = RangeDataset(0, 10) data_loader = torch.utils.data.DataLoader( dataset, batch_size=4, shuffle=True, num_workers=2, drop_last=True) for i, batch in enumerate(data_loader): print(i, batch)

batch_size 4, . shuffle=True , , . drop_last=True , , , batch_size , . , num_workers «», , . , DataLoader , , , , .

, : DataLoader , , , __getitem__ , , DataLoader . , __getitem__ , DataLoader , , . , , __getitem__ dict(example=example, label=label) , , DataLoader , dict(example=[example1, example2, ...], label=[label1, label2, ...]) , , , . , collate_fn DataLoader .

: torchvision , , torchvision.datasets.CIFAR10 . torchaudio torchtext .

Conclusion

, PyTorch, API, , PyTorch. PyTorch, , PyTorch. , PyTorch LSGAN, TensorFlow , . , .

Source: https://habr.com/ru/post/354912/

All Articles

PyTorch Tour

More articles: