Dangerous pickles - malicious serialization in Python

Hello!

Panta rhei and now the launch of the updated “Python Web Developer” course is approaching , and we still have some material that we found very interesting and we want to share with you.

What are dangerous pickles?

These pickles are extremely dangerous. I do not even know how to explain how. Just trust me. This is important, you understand?

“Explosive Disorder” Pan Telare
')
Before plunging into the opcode, let's talk about the basics. In the standard Python library, there is a module called pickle (in translation “salty cucumber” or simply “conservation”), which is used to serialize and deserialize objects. Only it is called not serialization / deserialization, but pickling / unpickling (literally - “conservation / re-activation”).

As a person who is still tormented by nightmares after using Boost Serialization in C ++, I can say that the conservation is excellent. Whatever you throw at her, she continues to Just Work. And not only with builtin types - in most cases, you can serialize your classes without needing to write serialization preservation methods. Even with objects such as recursive data structures (which would cause a crash when using a similar marshal module), there are no problems.

Let's give a quick example for those who are not familiar with the pickle module:

import pickle #      Python original = { 'a': 0, 'b': [1, 2, 3] } #     pickled = pickle.dumps(original) #      identical = pickle.loads(pickled)

This is enough in most cases. Preservation is really cool ... but somewhere in the depths of the darkness is hidden.

In one of the first lines of the pickle module it is written:
Warning: The pickle module is not protected from erroneous and malicious data. Never re-save data from an unreliable and unauthorized source.

I read this warning many times and often wondered what malicious data might be like. And recently, I decided it was time to find out. And for good reason.
My quest to create malicious data helped me learn a lot about the work of the pickle protocol, discover cool debugging methods for preservation, and find a couple of daring comments in the Python source code. If you continue reading, you will get the same benefits (and soon you will also start sending people your malicious conservation files). Warning: there will be technical details, the only prerequisite is basic knowledge of Python. But superficial knowledge of the assembler does not hurt.

Unreal Pickle Bomb

I started by reading the pickle module documentation, hoping to find clues on how to become an elite hacker, and came across a line:
The pickletools module contains tools for analyzing data streams generated by conservation. The pickletools source code contains extensive comments about opcodes used by pickle protocols.

Opcodes? I did not at all expect that the pickle implementation would be like this:

 def dumps(obj): return obj.__repr__() def loads(pickled): # :  pickle  ... return eval(pickled)

But I also did not expect her to define her own low-level language. Fortunately, the second part of the line is telling the truth - pickletools modules are very helpful in understanding how the protocols work. Plus, the comments in the code were very funny.

For example, we will ask the question about which version of the protocol we need to focus on. In Python 3.6, there are a total of five. They are numbered from 0 to 4. Protocol 0 is an obvious choice, because it is called “ readable ” in the documentation, and the source code pickletools offers additional information:

The pickle opcodes never disappear, even when new ways to do something appear. The PM repertoire only grows with time ... “The bloating of the opcode” is not a subtle hint, but a source of debilitating difficulties.

It turns out that each new protocol is a superset of the previous one. Even if we do not take into account that protocol 0 is “readable” (it does not matter, because we are decompiling the instructions), it also contains the smallest possible number of opcodes. What is perfect if the goal is to understand how malicious pickle files are created.

If you're confused with opcodes, don't worry. We will now return to Python, and then I will explain in detail how opcodes relate to Python code. Create a simple Python class without opcodes.

 class Bomb: def __init__(self, name): self.name = name def __getstate__(self): return self.name def __setstate__(self, state): self.name = state print(f'Bang! From, {self.name}.') bomb = Bomb('Evan')

The __setstate __ () and __getstate __ () methods are used in the pickle module to serialize and deserialize classes. Often you do not need to define them yourself, because the default implementations simply serialize the __dict__ instance. As you can see, I directly identified them here to hide a small surprise at the moment of the de-serialization of the Bomb object.

Check if the deserialization code works with a surprise. We will conserve and reactivate the object using:

 import pickle pickled_bomb = pickle.dumps(bomb, protocol=0) unpickled_bomb = pickle.loads(pickled_bomb)

We get:

 # -!  . Bang! From, Evan.

Exactly according to the plan! There is only one problem: if we try to deserialize the string pickled_bomp in a context where Bomb is undefined, it will fail. Instead, an error will appear:

 AttributeError: Can't get attribute 'Bomb' on <module '__main__'>

It turns out that we can run our custom __setstate__() method only if the __setstate__() context already has access to the code with our malicious print expression. And if we already have access to the code launched by the victim, why bother with the pickle at all? We can simply write the malicious code in any other method that the victim uses. And this is true - I just wanted to demonstrate clearly.

In the end, it is not in vain to suspect that Pyton can support conservation bytecode for an object deserialization method. For example, the marshal module can serialize methods, and many pickle alternatives: marshmallow , dill , and pyro also support the serialization function.

However, the ominous warning in the pickle documentation is not the point. You need to dive a little deeper to find out the dangers of deserialization.

Decompile Pickle

It is time to try to understand how conservation really works. Let's start by looking at the object from the previous section - pickled_bomb.

 b'ccopy_reg\n_reconstructor\np0\n(c__main__\nBomb\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\nVEvan\np5\nb.'

Wait ... we used protocol 0? Is it “readable”?

But that's okay, we have to find “extensive comments about opcodes used by pickle protocols” in the pickletools source code. They should help us understand the problem!

I desperately document this in detail - read the pickle code completely to find all special cases.

- a comment in the source code of pickletools

God What we fit in?

Jokes aside, the source code for pickle tools is really well commented. And the tools themselves are no less useful. For example, there is a pickle disassembly method called pickletools.dis (). He will help translate our pickle into a more understandable language.

To disassemble our pickled_bomb line, simply run the following:

 import pickletools pickletools.dis(pickled_bomb)

   : 0: c GLOBAL 'copy_reg _reconstructor' 25: p PUT 0 28: ( MARK 29: c GLOBAL '__main__ Bomb' 44: p PUT 1 47: c GLOBAL '__builtin__ object' 67: p PUT 2 70: N NONE 71: t TUPLE (MARK at 28) 72: p PUT 3 75: R REDUCE 76: p PUT 4 79: V UNICODE 'Evan' 85: p PUT 5 88: b BUILD 89: . STOP highest protocol among opcodes = 0

If you dealt with languages like x86 , Dalvik , CLR , then all of the above may seem familiar. But even if they didn’t have it, it doesn’t matter, we’ll sort it out in steps. Now it’s enough to know that capital words like GLOBAL, PUT, and MARK are opcodes, and instructions that are interpreted almost as functions in high-level languages. All to the right are the arguments of these functions, and to the left is shown how they were encrypted in the original “readable” line.

But before starting the step-by-step parsing, let's imagine another useful thing from pickletools: pickletools.optimize (). This method removes unused opcodes from pickle. The output is a simplified, but similar pickle. We can parse an optimized version of pickled_bomb by running the following:

 pickled_bomb = pickletools.optimize(pickled_bomb) pickletools.dis(pickled_bomb)

And we get a simplified version of a series of instructions:

  0: c GLOBAL 'copy_reg _reconstructor' 25: ( MARK 26: c GLOBAL '__main__ Bomb' 41: c GLOBAL '__builtin__ object' 61: N NONE 62: t TUPLE (MARK at 25) 63: R REDUCE 64: V UNICODE 'Evan' 70: b BUILD 71: . STOP highest protocol among opcodes = 0

You may notice that this differs from the original only in the absence of all PUT opcodes. Which leaves us with 10 instructional steps to understand. Soon, we will examine them separately and manually “sort” the code in Python.

During re-opening, opcodes are usually interpreted by an entity called Pickle Machine (PM). Each pickle is a program running on PM, just like compiled Java code runs on a Java Virtual Machine (JVM) . To parse our pickle code, you need to understand how PM works.

There are two areas in PM for storing and interacting with data: memo and stack. Memo is designed for long-term storage, and is similar to the Python dictionary that maps integers and objects. Stack is like a Python list, with which many operations interact, adding and pulling things out. We can emulate these Python data areas as follows:

 #  / PM memo = {} # Stack PM,       stack = []

During re-activation, PM reads the pickle program and executes each instruction sequentially. It ends whenever it reaches the STOP opcode; any object at the top of the stack is the final result of the re-activation. Using our emulated memo and stack repositories, try translating our pickle to Python ... instruction by instruction.

GLOBAL pushes the class and function to the stack, passing the module and name as arguments. Note that the message is a bit misleading, because in Python 3, copy_reg was renamed copyreg.

MARK pushes a special markobject into the stack so that we can later use it to refine part of the stack. We will use the “MARK” string to represent the markobject.
```
 #  markobject  . # 25: ( MARK stack.append('MARK') 
```

GLOBAL again. But this time with the __main__ module, so we do not need to import.
```
 #    (module.attr)  . # 26: c GLOBAL '__main__ Bomb' stack.append(Bomb) 
```

GLOBAL again. And we do not need to explicitly import an object.

 #    (module.attr)  . # 41: c GLOBAL '__builtin__ object' stack.append(object)

NONE just pushes None to stack.

 #  None  . # 61: N NONE stack.append(None)

TUPLE is a little trickier. Remember how we used to add “MARK” to the stack? This operation will move everything from the stack after “MARK” to the tuple. After that, it will remove the “MARK” and replace it with a tuple.

 #      ,  markobject. # 62: t TUPLE (MARK at 28) last_mark_index = len(stack) - 1 - stack[::-1].index('MARK') mark_tuple = tuple(stack[last_mark_index + 1:]) stack = stack[:last_mark_index] + [mark_tuple]   ,     . #    TUPLE: [<function copyreg._reconstructor>, 'MARK', __main__.Bomb, object, None] #    TUPLE: [<function copyreg._reconstructor>, (__main__.Bomb, object, None)]

REDUCE removes the last two things from the stack. After that, it calls the penultimate object using the positional extension of the last thing, and the result is placed on the stack. It's hard to explain with words, but everything is clear in the code
```
 #  ,   callable  tuple . # 63: R REDUCE args = stack.pop() callable = stack.pop() stack.append(callable(*args)) 
```

UNICODE just pushes the Unicode string to the stack (a very nice Unicode string, by the way!)
```
 #    Python Unicode. # 64: V UNICODE 'Evan' stack.append(u'Evan') 
```

BUILD removes the last object from the stack and then passes it as an argument to __setstate __ () with the new last thing in the stack.
```
 #      __setstate__  dict. # 70: b BUILD arg = stack.pop() stack[-1].__setstate__(arg) 
```

STOP simply means that any item at the top of the stack is our final result.
```
 #  PM. # 71: . STOP unpickled_bomb = stack[-1] 
```

Whew, we're done! Not sure if our code is especially Python ... but it emulates PM. You may notice that we have never used a memo. Remember all those PUT opcodes that were removed with pickletools.optimize ()? They could interact with momo, but in our simple example it was not needed.

Let's try to simplify the code to visually show its work. In fact, in addition to mixing data, only three operations occur: importing _reconstructor into instruction 1, calling _reconstructor into instruction 7 and calling __setstate __ () in instruction 9. If you mentally imagine mixing data, you can express all three lines of Python.

 #  1,    `_reconstructor` from copyreg import _reconstructor #  7,  `_reconstructor`   unpickled_bomb = _reconstructor(cls=Bomb, base=object, state=None) #  9,  `__setstate__`   unpickled_bomb.__setstate__('Evan')

An inside look at the source code copyreg._reconstructor () reveals that we simply call object .__ new __ (Bomb). Using this knowledge, we can simplify all up to two lines.

 unpickled_bomb = object.__new__(Bomb) unpickled_bomb.__setstate__('Evan')

Congratulations, you just decompiled pickle!

Real Pickle Bomb

I am not a pickle expert, but I have already outlined how to construct a malicious pickle. You can use the GLOBAL opcode to import any function - os.system and __builtin __. Eval seem to be suitable candidates. Then we use REDUCE to execute it with an arbitrary argument. But just ... wait, what is it?

If it is not isinstance (callable, type), REDUCE will not swear only if callable was registered in the copyreg module's safe_constructors dictionary, or callable has the magic attribute __safe_for_unpickling__ with a true value. I do not know why this is happening, but I have seen a sufficient number of complaints <winks>.

Wink in response. It looks like the pickletools documentation suggests that only allowed callable can be performed by REDUCE. For a moment, it made me worried, but the search for safe_constuctors quickly helped find PEP 307 of 2003.

In previous versions of Python, the reenergation had a “security check” on individual operations, refusing to call functions or constructors that were not marked “safe for reopening” for having the __safe_for_unpickling__ attribute equal to 1, or registering in the global register copy_reg.safe_constructors.

This feature creates a false sense of security: no one has ever performed the necessary extensive code review to prove that pickle de-reserving from unreliable sources cannot cause unwanted code. In fact, the bugs in the pickle.py Python 2.2 module make it easy to get around these precautions.

We firmly believe that when using the Internet, it is better to know that your protocol is not secure than to trust the security of the protocol, whose implementation has not been thoroughly tested. Even high-quality implementation of popular protocols often contains errors; without a large time investment, the implementation of pickle in Python simply cannot guarantee. Therefore, starting with the version of Python 2.3, all security checks for re-opening are officially excluded and replaced with a warning:
Warning : Do not reapply data from unreliable and unverified sources.

Hello, darkness, our old friend . Here it all began.

That's all, we have found the key ingredient, and there is no false sense of security left from what we plan to do. Let's start by writing our bomb:

 #       arbitrary python GLOBAL '__builtin__ eval' #      MARK #   Python,       UNICODE 'print("Bang! From, Evan.")' #    ,       REDUCE TUPLE #  `eval()`    Python    REDUCE #  STOP,   PM   STOP

To turn this into a real pickle, you need to replace each opcode with the corresponding ASCII code: c for GLOBAL, (for MARK, V for UNICODE, t for TUPLE, R for REDUCE, and. For STOP. Note that these are the same values that were written to the left of opcodes in the pickletools.dis () output earlier. The arguments are analyzed after each opcode, taking into account the combination of position and newline constraints. Each argument is located either immediately after the corresponding opcode, or after the previous argument, and is read continuously until until a newline character is found. Transfer to machine th pickle Code provides as follows:

 c__builtin__ eval (Vprint("Bang! From, Evan.") tR.

Finally, we can check it out:

 #   ! #  , ! pickled_bomb = b'c__builtin__\neval\n(Vprint("Bang! From, Evan.")\ntR.' pickle.loads(pickled_bomb)

Iii ...

 # -!  . Bang! From, Evan.

I know that you have no reason to believe me, but it really worked the first time.
It is easy to understand that someone can easily come up with a more malicious argument for eval (). PM can be forced to do literally anything that Python code can execute, including the system commands os.system ().

All good things come to an end

I was planning to learn how to make a dangerous pickle, but accidentally in the process I understood how pickle works. I admit, I enjoyed digging into this pickle machine. The source code for pickletools helped a lot , and I recommend it if you are interested in learning more about the pickle protocol and PM.

THE END

As always, we are waiting for suggestions and questions that can be asked here or personally by Ilya Lebedev at the Open Day .

Source: https://habr.com/ru/post/353480/

All Articles

Dangerous pickles - malicious serialization in Python

More articles: