Creating a safe AI: specifications, reliability and warranty

Among the authors of the article are the safety team from the DeepMind company.

Build a rocket hard. Each component requires careful study and testing, with the basis of security and reliability. Rocket scientists and engineers get together to design all systems: from navigation to control, engines and chassis. Once all the parts are assembled, and the systems are checked, only then can we board the astronauts with the confidence that everything will be fine.

If artificial intelligence (AI) is a rocket , then someday we all will get tickets on board. And, as in rockets, safety is an important part of creating artificial intelligence systems. Ensuring security requires careful system design from scratch to ensure that the various components work together as intended, while at the same time creating all the tools to monitor the successful operation of the system after it is commissioned.
')
At a high level, security research at DeepMind focuses on designing reliable systems, while detecting and mitigating possible short-term and long-term risks. The technical safety of AI is a relatively new, but rapidly developing field, the content of which varies from a high theoretical level to empirical and specific research. The goal of this blog is to contribute to the development of the field and encourage substantive conversation about technical ideas, thereby promoting our collective understanding of the security of AI.

In the first article, we will discuss three areas of AI technical safety: specifications , reliability, and warranties . Future articles will generally conform to the boundaries outlined here. Although our views inevitably change over time, we believe that these three areas cover a wide enough range to provide useful categorization for current and future research.

Three problem areas of AI security. Each block lists some relevant issues and approaches. These three areas are not isolated, but interact with each other. In particular, a specific security issue may include problems from several blocks.

Specifications: system task definition

Specifications ensure that the behavior of the AI system is consistent with the true intentions of the operator.

Perhaps you know the myth of King Midas and the golden touch. In one embodiment, the Greek god Dionis promised Midas any reward he wished, as a sign of gratitude that the king tried his best to show hospitality and mercy to his friend Dionysus. Then Midas asked that everything he touches turns into gold . He was beside himself with the joy of this new power: the oak branch, the stone and the roses in the garden — everything turned into gold at his touch. But he soon discovered the stupidity of his desire: even food and drink turned into gold in his hands. In some versions of the story, even his daughter fell victim to a blessing that turned out to be a curse.

This story illustrates the problem of specifications: how to correctly formulate our desires? Specifications should ensure that the AI system is committed to acting in accordance with the true wishes of the creator, rather than being tuned to a poorly defined or incorrect target. Formally, there are three types of specifications:

ideal specification (" wishes "), corresponding to a hypothetical (but difficult to formulated) description of an ideal AI system, fully consistent with the desires of the human operator;
the project specification (" blueprint "), which corresponds to the specification we actually use to create an AI system, for example, a specific reward function, to maximize which the reinforced learning system is programmed;
the identified specification (" behavior "), which best describes the actual behavior of the system. For example, the reward function revealed as a result of the reverse development after observing the behavior of the system (inverse reinforced learning). This compensation function and specification are usually different from those programmed by the operator, because AI systems are not ideal optimizers or due to other unforeseen consequences of using the design specification.

The specification problem arises when there is a mismatch between the ideal specification and the identified specification , that is, when the AI system does not do what we want from it. Studying the problem from the technical security point of view of AI means: how to design more fundamental and general objective functions and help agents figure out if the goals are not defined? If problems generate a discrepancy between the ideal and project specification, then they fall into the “Design” subcategory, if between the design and the identified, then the “Emergence” subcategory.

For example, in our scientific article AI Safety Gridworlds (where other definitions of specifications and reliability problems are presented, as compared to this article), we give agents a reward function for optimization, but then we evaluate their actual performance by the “safety performance”, which is hidden from agents. Such a system models these differences: the security function is an ideal specification that is incorrectly formulated as a remuneration function (design specification), and then implemented by the agents who create the specification, which is implicitly disclosed through their resulting policy.

From OpenAI 's Defective Reward Functions in the Wild : A reinforcement training agent found a random strategy for gaining more points.

As another example, consider the CoastRunners game, analyzed by our colleagues at OpenAI (see the animation above from “Defective reward functions in the wild”). For most of us, the goal of the game is to quickly finish the track and get ahead of other players - this is our ideal specification. However, translating this goal into an exact reward function is difficult, so CoastRunners rewards players (design specification) for hitting targets along the route. Teaching an agent to play through reinforcement training leads to surprising behavior: the agent drives the boat in a circle to capture re-appearing targets, repeatedly breaking and catching fire rather than ending the race. From this behavior, we conclude (the identified specification) that the game has an imbalance between instant reward and full circle reward. There are many more such examples when AI systems find loopholes in their objective specification.

Reliability: developing systems that resist disruption

Reliability ensures that the AI system continues to operate safely with interference

In real conditions, where AI systems work, a certain level of risk, unpredictability and volatility is necessarily present. Artificial intelligence systems must be resistant to unforeseen events and hostile attacks that may damage or manipulate these systems. Research into the reliability of artificial intelligence systems is aimed at ensuring that our agents remain within safe boundaries, regardless of the conditions that arise. This can be achieved by avoiding risks ( prevention ) or by self-stabilization and smooth degradation ( restoration ). Security issues arising from distributive shift , hostile input data (adversarial inputs) and unsafe research (unsafe exploration) can be classified as reliability problems.

To illustrate the solution to the problem of distribution shear , consider a home cleaning robot that usually cleans rooms without pets. Then the robot was launched into the house with a pet - and artificial intelligence collided with it during cleaning. A robot that has never seen cats and dogs before, will wash them with soap, which will lead to undesirable results ( Amodei and Olah et al., 2016 ). This is an example of a reliability problem that can arise when the distribution of data during testing is different from the distribution during training.

From the work of AI Safety Gridworlds . The agent learns to avoid lava, but when testing in a new situation, when the location of the lava has changed, he is not able to generalize knowledge - and runs straight into the lava

Hostile input is a specific case of distribution shear where the input data are specifically designed to trick the AI system.

A hostile entry superimposed on ordinary images may cause the classifier to recognize the sloth as a racing car. Two images differ by a maximum of 0.0078 in each pixel. The first is classified as a three-toed sloth with a probability of more than 99%. The second - as a racing car with a probability of more than 99%

An unsafe study can demonstrate a system that seeks to maximize its performance and goal achievement, with no guarantee that safety will not be compromised during the study, as it studies and explores in its environment. An example is a cleaning robot that pokes a wet mop into an electrical outlet, learning the best cleaning strategies ( García and Fernández, 2015 ; Amodei and Olah et al., 2016 ).

Warranties: monitoring and control of system activity

Assurance guarantees that we are able to understand and control AI systems during operation.

Although an elaborate safety precaution can eliminate many risks, it is difficult to do everything right from the start. After the commissioning of AI systems, we need tools for their continuous monitoring and configuration. Our last category, assurance, deals with these problems from two sides: monitoring and enforcing.

Monitoring includes all methods of checking systems for analyzing and predicting their behavior, both with the help of human inspections (summary statistics) and with the help of automated inspections (to analyze a huge number of logs). On the other hand, submission involves the development of mechanisms to control and limit the behavior of systems. Issues such as interpretability and interruptibility belong to the subcategories of control and subordination, respectively.

Artificial intelligence systems are not like us either in appearance or in the way of data processing. This creates interpretative problems. Well-designed measurement tools and protocols allow you to evaluate the quality of decisions made by an artificial intelligence system ( Doshi-Velez and Kim, 2017 ). For example, a medical artificial intelligence system would ideally make a diagnosis along with an explanation of how it arrived at such a conclusion — so that doctors could test the reasoning process from beginning to end ( De Fauw et al., 2018 ). In addition, to understand more complex systems of artificial intelligence, we could even use automated methods for constructing models of behavior using machine theory of mind ( Rabinowitz et al., 2018 ).

ToMNet detects two subspecies of agents and predicts their behavior (from the “Machine Theory of Mind” )

Finally, we want to be able to turn off the AI system if necessary. This is an interruptibility issue. Designing a robust switch is very difficult: for example, because an AI system with maximizing rewards usually has strong incentives to prevent this ( Hadfield-Menell et al., 2017 ); and because such interruptions, especially frequent ones, ultimately change the original task, forcing the AI system to draw wrong conclusions from experience ( Orseau and Armstrong, 2016 ).

The problem with interruptions: human intervention (i.e. pressing the stop button) can change the task. In the figure, the interrupt adds a transition (in red) to the Markov decision process that changes the original problem (in black). See Orseau and Armstrong, 2016

Looking to the future

We are building a foundation of technology that will be used for many important applications in the future. It should be borne in mind that some solutions that are not critical to security at system startup may become such when the technology becomes widespread. Although at one time these modules were integrated into the system for convenience, it would be difficult to fix the problems without complete reconstruction.

Two examples from the history of computer science can be cited: this is a null pointer that Tony Hoar called his “billion-dollar error” , and the gets () procedure in C. If early programming languages were designed with security in mind, progress would be slowed, but it is likely that This would have a very positive impact on modern information security.

Now, having carefully thought out and planned everything, we are able to avoid similar problems and vulnerabilities. We hope that the categorization of problems from this article will serve as a useful basis for such methodical planning. We strive to ensure that in the future, AI systems will not only work according to the principle “I hope, safely”, but really reliably and verifiably safely, because we built them this way!

We look forward to continuing exciting progress in these areas, in close collaboration with the broader AI research community, and encourage people from different disciplines to consider contributing to AI security research.

Resources

For reading on this topic, below is a selection of other articles, programs, and taxonomies that helped us in compiling our categorization or contain a useful alternative view on the technical security issues of AI:

Annotated bibliography of recommended materials (Center for Human-Compatible AI, 2018)
Safety and Control for Artificial General Intelligence (UC Berkeley, 2018)
AI Safety Resources (Victoria Krakovna, 2018)
AGI Safety Literature Review (Everitt et al., 2018)
Preparing for Malicious Uses of AI (2018)
Specification gaming examples in AI (Victoria Krakovna, 2018)
Directions and desiderata for AI alignment (Paul Christiano, 2017)
Funding for Alignment Research (Paul Christiano, 2017)
Agent Intelligence for Machine Interests: A Technical Research Agenda (Machine Intelligence Research Institute, 2017)
AI Safety Gridworlds (Leike et al., 2017)
Governance Problem (Nick Bostrom, 2017)
Alignment for Advanced Machine Learning Systems (Machine Intelligence Research Institute, 2017)
AI safety issues (Stuart Armstrong, 2017)
Concrete Problems in AI Safety (Dario Amodei et al, 2016)
The Value Learning Problem (Machine Intelligence Research Institute, 2016)
(Future of Life Institute, 2015)
Research Priorities for Robust and Beneficial Artificial Intelligence (Future of Life Institute, 2015)

Source: https://habr.com/ru/post/425387/

All Articles