📜 ⬆️ ⬇️

Deep learning. Federated learning

image Hi, Habrozhiteli! We recently commissioned a book by Andrew W. Trask, laying the foundation for further mastering the technology of deep learning. It begins with a description of the basics of neural networks and then examines additional levels and architectures in detail.

We offer you to get acquainted with the passage "Federative Learning"

The idea of ​​federated learning originated from the fact that many data containing useful information for solving problems (for example, for diagnosing cancer using MRI) are difficult to obtain in quantities sufficient to train a powerful model of deep learning. In addition to useful information necessary for training the model, the data sets also contain other information that is not relevant to the problem being solved, but disclosing it to someone could potentially harm.
')
Federated learning is a technique for entering a model into a secure environment and learning it without moving data anywhere. Consider an example.

import numpy as np from collections import Counter import random import sys import codecsnp.random.seed(12345) with codecs.open('spam.txt',"r",encoding='utf-8',errors='ignore') as f: ←     http://www2.aueb.gr/users/ion/data/enron-spam/ raw = f.readlines() vocab, spam, ham = (set(["<unk>"]), list(), list()) for row in raw: spam.append(set(row[:-2].split(" "))) for word in spam[-1]: vocab.add(word) with codecs.open('ham.txt',"r",encoding='utf-8',errors='ignore') as f: raw = f.readlines() for row in raw: ham.append(set(row[:-2].split(" "))) for word in ham[-1]: vocab.add(word) vocab, w2i = (list(vocab), {}) for i,w in enumerate(vocab): w2i[w] = i def to_indices(input, l=500): indices = list() for line in input: if(len(line) < l): line = list(line) + ["<unk>"] * (l - len(line)) idxs = list() for word in line: idxs.append(w2i[word]) indices.append(idxs) return indices 

We train to identify spam


Suppose we need to train a model to identify spam from people's emails.

In this case, we are talking about email classification. We will train our first model on a publicly available dataset called Enron. This is a huge corpus of emails published during Enron hearings (now it's a standard email analytics corpus). Interesting fact: I was familiar with people who, by the nature of their activities, had to read / comment on this data set, and they note that people sent each other in these letters the most different information (often very personal). But since this corps was promulgated during court proceedings, it can now be used without restriction.

The code in the previous and in this section implements only preparatory operations. The input files (ham.txt and spam.txt) are available on the book's webpage: www.manning.com/books/grokking-deep-learning and in the GitHub repository: github.com/iamtrask/Grokking-Deep-Learning . We need to pre-process it in order to prepare it for transmission to the Embedding class from chapter 13, where we created our own deep learning framework. As before, all words in this corpus are converted into index lists. In addition, we give all letters to the same length of 500 words, either cutting them off or adding lexemes. Because of this, we get a rectangular data set.

 spam_idx = to_indices(spam) ham_idx = to_indices(ham) train_spam_idx = spam_idx[0:-1000] train_ham_idx = ham_idx[0:-1000] test_spam_idx = spam_idx[-1000:] test_ham_idx = ham_idx[-1000:] train_data = list() train_target = list() test_data = list() test_target = list() for i in range(max(len(train_spam_idx),len(train_ham_idx))): train_data.append(train_spam_idx[i%len(train_spam_idx)]) train_target.append([1]) train_data.append(train_ham_idx[i%len(train_ham_idx)]) train_target.append([0]) for i in range(max(len(test_spam_idx),len(test_ham_idx))): test_data.append(test_spam_idx[i%len(test_spam_idx)]) test_target.append([1]) test_data.append(test_ham_idx[i%len(test_ham_idx)]) test_target.append([0]) def train(model, input_data, target_data, batch_size=500, iterations=5): n_batches = int(len(input_data) / batch_size) for iter in range(iterations): iter_loss = 0 for b_i in range(n_batches): #         model.weight.data[w2i['<unk>']] *= 0 input = Tensor(input_data[b_i*bs:(b_i+1)*bs], autograd=True) target = Tensor(target_data[b_i*bs:(b_i+1)*bs], autograd=True) pred = model.forward(input).sum(1).sigmoid() loss = criterion.forward(pred,target) loss.backward() optim.step() iter_loss += loss.data[0] / bs sys.stdout.write("\r\tLoss:" + str(iter_loss / (b_i+1))) print() return model def test(model, test_input, test_output): model.weight.data[w2i['<unk>']] *= 0 input = Tensor(test_input, autograd=True) target = Tensor(test_output, autograd=True) pred = model.forward(input).sum(1).sigmoid() return ((pred.data > 0.5) == target.data).mean() ']] * = spam_idx = to_indices(spam) ham_idx = to_indices(ham) train_spam_idx = spam_idx[0:-1000] train_ham_idx = ham_idx[0:-1000] test_spam_idx = spam_idx[-1000:] test_ham_idx = ham_idx[-1000:] train_data = list() train_target = list() test_data = list() test_target = list() for i in range(max(len(train_spam_idx),len(train_ham_idx))): train_data.append(train_spam_idx[i%len(train_spam_idx)]) train_target.append([1]) train_data.append(train_ham_idx[i%len(train_ham_idx)]) train_target.append([0]) for i in range(max(len(test_spam_idx),len(test_ham_idx))): test_data.append(test_spam_idx[i%len(test_spam_idx)]) test_target.append([1]) test_data.append(test_ham_idx[i%len(test_ham_idx)]) test_target.append([0]) def train(model, input_data, target_data, batch_size=500, iterations=5): n_batches = int(len(input_data) / batch_size) for iter in range(iterations): iter_loss = 0 for b_i in range(n_batches): #         model.weight.data[w2i['<unk>']] *= 0 input = Tensor(input_data[b_i*bs:(b_i+1)*bs], autograd=True) target = Tensor(target_data[b_i*bs:(b_i+1)*bs], autograd=True) pred = model.forward(input).sum(1).sigmoid() loss = criterion.forward(pred,target) loss.backward() optim.step() iter_loss += loss.data[0] / bs sys.stdout.write("\r\tLoss:" + str(iter_loss / (b_i+1))) print() return model def test(model, test_input, test_output): model.weight.data[w2i['<unk>']] *= 0 input = Tensor(test_input, autograd=True) target = Tensor(test_output, autograd=True) pred = model.forward(input).sum(1).sigmoid() return ((pred.data > 0.5) == target.data).mean() ']] * = spam_idx = to_indices(spam) ham_idx = to_indices(ham) train_spam_idx = spam_idx[0:-1000] train_ham_idx = ham_idx[0:-1000] test_spam_idx = spam_idx[-1000:] test_ham_idx = ham_idx[-1000:] train_data = list() train_target = list() test_data = list() test_target = list() for i in range(max(len(train_spam_idx),len(train_ham_idx))): train_data.append(train_spam_idx[i%len(train_spam_idx)]) train_target.append([1]) train_data.append(train_ham_idx[i%len(train_ham_idx)]) train_target.append([0]) for i in range(max(len(test_spam_idx),len(test_ham_idx))): test_data.append(test_spam_idx[i%len(test_spam_idx)]) test_target.append([1]) test_data.append(test_ham_idx[i%len(test_ham_idx)]) test_target.append([0]) def train(model, input_data, target_data, batch_size=500, iterations=5): n_batches = int(len(input_data) / batch_size) for iter in range(iterations): iter_loss = 0 for b_i in range(n_batches): #         model.weight.data[w2i['<unk>']] *= 0 input = Tensor(input_data[b_i*bs:(b_i+1)*bs], autograd=True) target = Tensor(target_data[b_i*bs:(b_i+1)*bs], autograd=True) pred = model.forward(input).sum(1).sigmoid() loss = criterion.forward(pred,target) loss.backward() optim.step() iter_loss += loss.data[0] / bs sys.stdout.write("\r\tLoss:" + str(iter_loss / (b_i+1))) print() return model def test(model, test_input, test_output): model.weight.data[w2i['<unk>']] *= 0 input = Tensor(test_input, autograd=True) target = Tensor(test_output, autograd=True) pred = model.forward(input).sum(1).sigmoid() return ((pred.data > 0.5) == target.data).mean() 

Having defined the helper functions train () and test (), we can initialize the neural network and train it by writing just a few lines of code. After three iterations, the network is able to classify the control data set with an accuracy of 99.45% (the control data set is well balanced, so this result can be considered excellent):

 model = Embedding(vocab_size=len(vocab), dim=1) model.weight.data *= 0 criterion = MSELoss() optim = SGD(parameters=model.get_parameters(), alpha=0.01) for i in range(3): model = train(model, train_data, train_target, iterations=1) print("% Correct on Test Set: " + \ str(test(model, test_data, test_target)*100)) ______________________________________________________________________________ Loss:0.037140416860871446 % Correct on Test Set: 98.65 Loss:0.011258669226059114 % Correct on Test Set: 99.15 Loss:0.008068268387986223 % Correct on Test Set: 99.45 

Make the model federated


Above was the most usual deep learning. Now add privacy

In the previous section, we implemented an example of email analysis. Now put all the emails in one place. This is a good old way of working (which is still widely used throughout the world). To begin with, we simulate a federated learning environment in which there are several different collections of letters:

 bob = (train_data[0:1000], train_target[0:1000]) alice = (train_data[1000:2000], train_target[1000:2000]) sue = (train_data[2000:], train_target[2000:]) 

So far, nothing complicated. Now we can perform the same learning procedure as before, but on three separate data sets. After each iteration, we will average the values ​​in the Bob (Bob), Alice (Alice) and Sue (Sue) models and evaluate the results. Note that some federated learning methods involve combining after each package (or collection of packages); I decided to keep the code as simple as possible:

 for i in range(3): print("Starting Training Round...") print("\tStep 1: send the model to Bob") bob_model = train(copy.deepcopy(model), bob[0], bob[1], iterations=1) print("\n\tStep 2: send the model to Alice") alice_model = train(copy.deepcopy(model), alice[0], alice[1], iterations=1) print("\n\tStep 3: Send the model to Sue") sue_model = train(copy.deepcopy(model), sue[0], sue[1], iterations=1) print("\n\tAverage Everyone's New Models") model.weight.data = (bob_model.weight.data + \ alice_model.weight.data + \ sue_model.weight.data)/3 print("\t% Correct on Test Set: " + \ str(test(model, test_data, test_target)*100)) print("\nRepeat!!\n") 


Below is a fragment with the results. This model has reached almost the same level of accuracy as the previous one, and theoretically we did not have access to the training data - or not? Whatever it was, but each person changes the model in the learning process, right? Are we really not able to extract anything from their data sets?

 Starting Training Round... Step 1: send the model to Bob Loss:0.21908166249699718 ...... Step 3: Send the model to Sue Loss:0.015368461608470256 Average Everyone's New Models % Correct on Test Set: 98.8 

Breaking through the federated model


Consider a simple example of how to extract information from a training dataset.

Federated learning suffers from two major problems, especially intractable, when each person has only a small handful of teaching examples, speed and confidentiality. As it turns out, if someone has only a few training examples (or the model that was sent to you was trained on only a few examples: a training package), you can still learn quite a lot about the source data. If you imagine that you have 10,000 people (and each has a very small amount of data), most of the time you spend on sending the model back and forth and not so much on training (especially if the model is very large).

But let's not get ahead. Let's see what you can find out after the user performs the update of the weights in one package:

 import copy bobs_email = ["my", "computer", "password", "is", "pizza"] bob_input = np.array([[w2i[x] for x in bobs_email]]) bob_target = np.array([[0]]) model = Embedding(vocab_size=len(vocab), dim=1) model.weight.data *= 0 bobs_model = train(copy.deepcopy(model), bob_input, bob_target, iterations=1, batch_size=1) 

Bob creates and trains the model on emails in his inbox. But it so happened that he saved his password by sending himself a letter with the text: “My computer password is pizza”. Naive Bob! Having looked at which weights have changed, we can find out the dictionary (and understand the meaning) of Bob’s letter:

 for i, v in enumerate(bobs_model.weight.data - model.weight.data): if(v != 0): print(vocab[i]) 

In this simple way, we learned Bob's top-secret password (and, perhaps, his culinary preferences). And what to do? How to trust federated training, if it is so easy to find out which training data caused a change in weights?

 is pizza computer password my 

»More information about the book can be found on the publisher's website.
» Table of Contents
» Excerpt

For Habrozhiteley 30% discount on pre-order book coupon - Grokking Deep Learning

Source: https://habr.com/ru/post/458800/


All Articles