Recently, experts have increasingly raised the issue of the safety of machine learning models and offer various ways of protection. It is high time to examine in detail potential vulnerabilities and defenses in the context of popular traditional modeling systems, such as linear and tree models, trained in static data sets. Although the author of the article is not a security expert, he very closely follows topics such as debugging, explanations, fairness, interpretability, and privacy in machine learning.
In this article, we present several probable attack vectors for a typical machine learning system in a typical organization, offer indicative solutions for protection, and consider some common problems and the most promising practices.
1. Attacks to distort data
Distortion of data means that someone systematically changes the training data to manipulate the predictions of your model (such attacks are also called “causal” attacks). To distort the data, the attacker must have access to some or all of your data for training. And in the absence of proper control in many companies, different employees, consultants and contractors may have such access. An attacker outside the security perimeter can also gain unauthorized access to some or all of the training data.
')
A direct attack to distort data may include a change in dataset tags. Thus, whatever the commercial use of your model, an attacker can manage her predictions, for example, by changing the tags, so that your model can learn to issue large loans, large discounts, or set small insurance premiums for attackers. Forcing a model to make false predictions in the interests of the attacker is sometimes called a violation of the “integrity” of the model.
An attacker can also use data corruption to train your model for the purpose of deliberately discriminating a group of people, depriving them of a large loan, large discounts, or low insurance premiums that they rightfully rely on. At its core, this attack is similar to DDoS. Forcing a model to make false predictions to harm others is sometimes called a violation of the model’s “accessibility”.
Although it may seem that it is easier to distort the data than to change the values ​​in the existing dataset rows, you can also add distortions by adding innocuous or extra columns to the datasets. Changed values ​​in these columns can then cause a change in model predictions.
Now let's look at some possible defensive and expert (forensic) solutions in case of data corruption:
- Differential Impact Analysis . Many banks are already conducting a differential impact analysis for equitable lending to determine if their model does not discriminate against different categories of individuals. However, many other organizations have not progressed so far. There are several excellent open source tools for identifying discrimination and conducting differential impact analysis. For example, Aequitas, Themis and AIF360 .
- Fair or private-model . Models such as learning fair representations (LFR) and private aggregation of training ensembles (PATE), try to pay less attention to individual demographic properties when generating forecasts. Also, these models may be less susceptible to discriminatory attacks in order to distort data.
- Deviation with a negative impact (Reject on Negative Impact - RONI) . RONI is a method of deleting data lines from a dataset that reduce the accuracy of forecasting. For more information about RONI, see Section 8, Machine Learning Security .
- Residual analysis . Search for strange, noticeable patterns in the discrepancies in the predictions of your model, especially related to employees, consultants or contractors.
- Self-reflection . Evaluate the models on your employees, consultants and contractors in order to identify abnormally favorable forecasts.
Differential impact analysis, residual analysis and self-reflection can be carried out during training and as part of model monitoring in real time.
2. Watermark Attacks
A watermark is a term borrowed from the safety literature of deep learning, which often refers to adding special pixels to an image to get the desired result from your model. It is possible to do the same with customer or transaction data.
Consider a scenario in which an employee, consultant, contractor or an attacker from the outside has access to the code for the production-use of your model, which makes forecasts in real time. Such a person can change the code to recognize a strange or unlikely combination of the values ​​of the input variables to obtain the desired prediction result. Like data corruption, watermark attacks can be used to disrupt the integrity or availability of your model. For example, to violate integrity, an attacker could insert a “payload” into the evaluation code for the production-use model, with the result that it recognizes a combination of 0 years of age at address 99, which will lead to some kind of positive prediction for the attacker. And to block the accessibility of the model, he can insert an artificially discriminatory rule into the assessment code, which will not allow the model to give positive results for a certain group of persons.
Protective and expert approaches to watermarked attacks may include:
- Anomaly detection . Auto-coders are a fraud detection model that can identify input data that is complex and strange, or not similar to other data. Potentially autocoders can detect any watermarks used to trigger malicious mechanisms.
- Restrictions on data integrity . Many databases do not allow strange or unrealistic combinations of input variables, which can potentially prevent watermark attacks. The same effect can work for integrity constraints to data streams that are received in real time.
- Differential Impact Analysis : See Section 1 .
- Version control . The evaluation code for the production-application of the model should be managed and monitored on the basis of versions, like any other critical software product.
Identification of anomalies, data integrity constraints, and differential impact analysis can be used during training and as part of model monitoring in real time.
3. Inversion by surrogate models
Usually, “inversion” refers to obtaining unauthorized information from a model, rather than placing information in it. Also, inversion can be an example of "intelligence reverse engineering attack." If an attacker is able to get a lot of predictions from the API of your model or another endpoint (website, application, etc.), he can train his own
surrogate model . Simply put, this is a simulation of your predictive model! Theoretically, an attacker can train a surrogate model between the input data that he used to generate the predictions he received and the predictions themselves. Depending on the number of predictions that can be obtained, the surrogate model can become a fairly accurate simulation of your model. After learning the surrogate model, the attacker will have a sandbox from which he can plan for impersonalization (i.e., imitation) or a case-breaking attack on the integrity of your model, or have the potential to start restoring some aspects of your confidential training data. Surrogate models can also be trained using external data sources that are somehow consistent with your predictions, as
ProPublica , for example, did with the COMPAS recidivism model.
To protect your model from inversion using a surrogate model, you can use the following approaches:
- Authorized access . Ask for a forecast for additional authentication (for example, two-factor).
- Regulation of frequency of forecasts (Throttle predictions) . Limit a large number of quick predictions from individual users; consider the possibility of an artificial increase in the forecast delay.
- “White” (white-hat) surrogate models . As an exercise for the white hacker, try the following: train your own surrogate models between your input data and model predictions for a production application, and carefully observe the following aspects:
- accuracy limits of various types of “white” surrogate models; Try to understand to what extent the surrogate model can actually be used to get unwanted data about your model.
- types of data trends that can be learned from your white surrogate model, for example, linear trends represented by linear model coefficients.
- types of segments or demographic distributions that can be studied by analyzing the number of individuals assigned to specific nodes of the “white” surrogate decision tree.
- rules that can be learned from the “white” surrogate decision tree, for example, how to accurately depict a person who will receive a positive outlook.
4. Adversary attacks
In theory, a purposeful hacker can learn — say, through trial and error (i.e., “intelligence” or “sensitivity analysis”) —the inversion of a surrogate model or social engineering, how to play with your model, to get the desired prediction result or to avoid unwanted forecast. Attempting to achieve such goals using a specially developed data line is called an attack with a controversial example. (sometimes - an attack to investigate integrity). An attacker can use an adversarial example to obtain a large loan or low insurance premium, or to avoid being denied parole with a high criminal risk assessment. Some call using adversarial examples to exclude an undesirable result from the forecast as “evasion”.
Try the methods below to defend or detect an attack with a controversial example:
- Analysis of activation . Analysis of activation requires the presence of comparative internal mechanisms in your prognostic models, for example, the average activation of neurons in your neural network or the proportion of observations related to each end node in your random forest. Then you compare this information with the behavior of the model with real incoming data streams. As one of my colleagues said: “ It's like seeing one end node in a random forest, which corresponds to 0.1% of training data, but is suitable for 75% of scoring lines per hour .”
- Anomaly detection . see section 2 .
- Authorized access . see section 3 .
- Comparative models . When evaluating new data, in addition to the more complex model, use a comparative model of high transparency. Interpreted models are harder to crack because their mechanisms are transparent. When evaluating new data, compare the new model with a reliable transparent model, or a model trained on verified data and according to a trusted process. If the difference between the more complex and opaque model and the interpreted (or tested) one is too large, go back to the predictions of the conservative model or process the data row manually. Record this incident, it may be an attack with a competitive example.
- Throttle predictions : see section 3 .
- "White" sensitivity analysis . Use sensitivity analysis to conduct your own research attacks to understand which values ​​of variables (or their combinations) can cause large fluctuations in predictions. Look for these values ​​or combinations of values ​​when evaluating new data. To conduct a “white” research analysis, you can use the open source package cleverhans .
- “White” surrogate models: see section 3 .
Activation analysis or comparative models can be used during training and as part of model monitoring in real time.
5. Impersonalization
A goal-oriented hacker can find out - again, through trial and error, using a surrogate model or social engineering, which input or specific people get the desired prediction result. Then the attacker can impersonate this person in order to benefit from the prediction. Attacks with impersonalization are sometimes called “imitative” attacks, and from the point of view of the model, this resembles the theft of personal data. As in the case of an attack with a controversial example, with impersonalization, the input data artificially changes according to your model. But, in contrast to the same attack with a controversial example, in which a potentially random combination of values ​​can be used to deceive, during impersonalization, information related to another modeled object (for example, a convict, client) can be used to obtain a forecast associated with this type of object. , employee, financial transaction, patient, product, etc.). Suppose an attacker can find out from which characteristics of your model the provision of large discounts or benefits depends. Then he can falsify the information you use to get such a discount. An attacker can share his strategy with others, which can lead to large losses for your company.
If you use a two-step model, then beware of an “allergic” attack: an attacker can imitate a string of ordinary input data for the first stage of your model to attack its second stage.
Protective and expert approaches for attacks with impersonalization may include:
- Analysis of activation. see section 4 .
- Authorized access. see section 3 .
- Check for duplicates. During the scoring phase, track the number of similar records for which your model is available. This can be done in a space of reduced dimension using autocoders, multidimensional scaling (MDS) or similar methods of reducing the dimension. If there are too many similar lines for a certain period of time, take corrective measures.
- Options for notifying possible threats. Keep the
num_similar_queries
function in your pipeline, which may be useless immediately after training or implementing your model, but can be used during the evaluation (or during future re-training) to notify the model or pipeline about threats. For example, if at the time of grading, the value of num_similar_queries
greater than zero, then an evaluation request may be sent for manual analysis. In the future, when you re-train the model, you can teach it to give out negative prediction results to the input data lines with high num_similar_queries
values.
Activation analysis, duplicate checking and potential threat notification function can be used during training and as part of real-time model monitoring.
6. Common problems
Some common patterns of machine learning use also entail more general security problems.
Black boxes and unnecessary complexity . Although recent advances in interpretable models and model explanations allow for the use of accurate and transparent non-linear classifiers and regressors, many machine learning processes are still focused on black box models. They are only one type of often unnecessary complexity in the standard workflow of commercial machine learning. Other examples of potentially harmful complexity may include overly exotic technical specifications or a large number of package dependencies. This can be a problem for at least two reasons:
- An assertive and goal-oriented hacker may eventually learn more about your overly complex black box modeling system than you or your team (especially in today's overheated and rapidly changing market of data analysis). To do this, an attacker can use a variety of new model- independent explanation methods and classical sensitivity analysis, apart from many other more common tools of hacking. This imbalance of knowledge can potentially be used to carry out the attacks described in sections 1-5, or for other, as yet unknown types of attacks.
- Machine learning in a research and development environment largely depends on a diverse ecosystem of open source software packages. Some of these packages have many participants and users, others are highly specialized and are needed by a small circle of researchers and practitioners. It is known that many packages are supported by brilliant statisticians and machine learning researchers who focus on mathematics or algorithms, rather than software engineering and certainly not security. There are frequent cases when the machine learning pipeline depends on dozens or even hundreds of external packages, each of which can be cracked to conceal a malicious "payload."
Distributed systems and models . Fortunately or unfortunately, we live in the age of big data. Many organizations today use distributed data processing and machine learning systems. Distributed computing can be a big target for attacking from inside or outside. The data may be distorted only on one or several working nodes of a large distributed storage or data processing system. The “back door” for watermarks can be encoded into one model of a large ensemble. Instead of debugging one simple dataset or model, practitioners now have to study data or models scattered across large computing clusters.
Distributed Denial of Service (DDoS) attacks . If the predictive modeling service plays a key role in your organization's activities, make sure that you take into account even the most popular distributed DDoS attacks, when attackers attack the prediction service with an incredibly large number of requests to delay or stop the issuance of predictions for legitimate users.
7. General solutions
You can use several common, old and new, most effective methods to reduce the vulnerability of the security system and increase the fairness, accountability, transparency and trust in machine learning systems.
Authorized access and frequency control (throttling) prediction . Standard protections, such as additional authentication and prediction frequency adjustment, can be very effective in preventing a number of attack vectors described in sections 1-5.
Comparative models . As a comparative model to determine whether any manipulations were made with the forecast, you can use the old and proven modeling pipeline or another interpretable forecasting tool with high transparency. Manipulations include data corruption, watermark attacks, or attacks with a controversial example. If the difference between the forecast of your proven model and the forecast of a more complex and opaque model is too great, write down such cases. Send them to analysts or take other measures to analyze or correct the situation. Serious precautions must be taken to ensure that your comparative model and the conveyor remain safe and unchanged compared to their original, reliable condition.
Interpreted, fair or private models . Currently, there are methods (for example,
monotone GBM (M-GBM), scalable Bayesian rule lists (SBRL) ,
explained neural networks (XNN) ) that provide both accuracy and interpretability. These accurate and interpretable models are easier to document and debug than the classic “black boxes” of machine learning. Newer types of fair- and private-models (for example, LFR, PATE) can also be trained in how to pay less attention to the outside, the demographic characteristics that are available for observation, use by social engineering during an attack with a controversial example or impersonalization. Are you considering creating a new machine learning process in the future? Consider building it on the basis of less risky interpreted private or fair models. They are easier to debug and potentially resistant to changes in the characteristics of individual objects.
Debugging models for security . A new area of
debugging models is dedicated to detecting errors in the mechanisms and predictions of machine learning models and correcting them. Debugging tools such as surrogate models, residual analysis and sensitivity analysis can be used in “white” trials to identify your vulnerabilities, or in analytical exercises to identify any potential attacks that may occur or occur.
Documenting models and methods of explanation . Model documentation is a risk mitigation strategy that has been used in banking for decades. It allows you to save and transfer knowledge about complex modeling systems as the composition of the model owners changes. Traditionally, documentation has been applied to linear models of high transparency. But with the advent of powerful, accurate explanation tools (such as the
SHAP tree and
the derived attributes of local functions for neural networks), previously existing workflows of the black box models can be at least a little explained, debugged, and documented. Obviously, the documentation should now include all security objectives, including known, fixed, or expected vulnerabilities.
Monitor and manage models directly for safety . Serious practitioners understand that most of the models are trained in static “snapshots” of reality in the form of datasets, and that in real time the accuracy of forecasts decreases as the current state of affairs moves away from the information collected earlier. Today, the monitoring of most models is aimed at identifying such a shift in the distribution of input variables, which ultimately will lead to a decrease in accuracy. Model monitoring should be designed to track the attacks described in sections 1 through 5, and any other potential threats that are detected when debugging your model. Although this is not always directly related to safety, models should also be evaluated for differential impact in real time. Along with model documentation, all modeling artifacts, source code, and associated metadata should be managed, versioned, and security checked, as well as the valuable commercial assets that they are.
Options for notifying possible threats . Functions, rules and stages of pre-processing or post-processing can be included in your models or processes equipped with tools for notifying you of possible threats: for example, the number of similar lines in the model whether the current row represents an employee, contractor, or consultant; whether the values ​​in the current line are similar to those obtained in “white” attacks with a competitive example. These functions may or may not be needed during the first training of the model. But saving space for them may one day be very useful in evaluating new data or in the subsequent retraining of a model.
Detection of system anomalies . Teach an auto-coder-based anomaly detection anomaly on operating statistics of your entire prognostic modeling system (number of forecasts for a certain period of time, delays, processor, memory and disk usage, number of simultaneous users, etc.), and then closely monitor this meta model for anomalies. Anomaly can tell if something goes wrong. To accurately track the causes of the problem, follow-up investigations or special mechanisms will be required.
8. References and information for further reading.
A large number of modern academic literature on the safety of machine learning is focused on adaptive learning, deep learning and encryption. However, so far the author does not know the practitioners who would actually do all this. Therefore, in addition to recently published articles and posts, we present articles from the 1990s and early 2000s about network violations, virus detection, spam filtering, and related topics, which were also useful sources. If you want to learn more about the fascinating topic of protecting machine learning models, here are the main links - from the past and the present - that were used to write the post.
Conclusion
Those who are concerned with the science and practice of machine learning are concerned that the threat of hacking during machine learning, combined with the growing threat of breach of confidentiality and algorithmic discrimination, can increase growing social and political skepticism about machine learning and artificial intelligence. We must all remember the difficult times for AI in the recent past. Security vulnerabilities, breaches of confidentiality, and algorithmic discrimination can potentially be combined, leading to reduced funding for computer-based research, or draconian measures to regulate this area. Let's continue the discussion and solution of these important problems in order to prevent a crisis, and not to clear up its consequences.