Informational hiding in PDF documents

There is a mass of ways of information hiding some data inside other data. The most common thing that people usually recall is steganography in images, audio, and video information.

However, these containers are not exhausted. Together with ~~two slobs of~~ very talented students (namely, with lancerx and PavelBatusov ), we decided to develop a simple just4fun-designer of information hiding in electronic documents.

Link to what happened (do not judge strictly): pdf.stego.su
(PDF examples can be found here )
')
The interface of the satisfied user is presented in the kawai picture:

What is all this about?

Once over a cup of coffee, talking about steganography, we asked the question: " Is it possible to intersperse any additional third-party information into electronic text documents in such a way that the documents themselves do not visually change? ". This is how our little “steganography circle” appeared.

It turns out you can.

This is not a complete list.

OpenDocument Format (ODT) is also ISO / IEC 26300-1: 2015 , which, by the way, is not a lot of state standard (sic!) GOST R ISO / IEC 26300-2010 . Speaking on the fingers, the protocol is a zip-archive from xml'ek. Whoever does not believe can install LibreOffice , create an arbitrary document “example.odt”, rename it to “example.zip” and make sure that it is so. Space for creativity interspersed with extraneous information - the mass.
Office Open XML ( OOXL , aka DOCX , aka ISO / IEC IS 29500: 2008 ) is the answer to Microsoft's Chamberlain. From the point of view of information hiding the same eggs, only in profile. DOCX is also a zip archive with xml files, but organized differently.
DjVu (from the French “deja vu”) is a very interesting protocol for hiding. DjVu uses the JB2 algorithm, which searches for duplicate characters and saves their image only once. Appropriately there are a number of ideas:
- Select a set of all similar characters and select one using hash steganography .
- Choose two characters instead of one. The first character is considered transmitting 0, and the second character is considered transmitting 1. With the help of "alternation" you can transmit hidden information.
- Hide the data inside the actual image that represents the character in DjVU using LSB .
FictionBook (fb2) is xml. However, it may contain a binary tag, inside of which is a picture. Further hiding in the picture itself. You can also try to insert spaces and other characters outside the tags or inside the tags themselves.

You can continue for a long time, because mankind invented a lot of formats for storing text information.
For our concealment experiments, we chose PDF , since he has the following "advantages":

This format is not editable - therefore there are no problems with rewriting (Well ... in fact, we are also editing, but still more often PDF is used as an “uneditable” format)
this format is quite simple - about this below
this format is quite popular

How it works?

We called SHP , which is deciphered as Simple Hide to Pdf . Simple - because simple; Hide - because it hides; and “ to PDF ” - because it works only with PDF documents.

For literacy, a couple of paragraphs about the ISO 32000: 2008 protocol, which is PDF.
The document consists of objects (the so-called obj ). At the end of the document there is an xref table that lists all the necessary objects. Each object has a number and revision ... Yes, exactly, pdf supports different revisions! PDF is actually a mini version control system ! ;)) That's just something in my life did not stick ...

A PDF document is formed by objects of different types:

boolean variables
numbers (whole and fractional)
lines
arrays
dictionaries
streams
comments

Roughly speaking, the PDF structure is as follows:

headline
objects ( obj data)
xref table
trailer (contains information about the objects from which to start reading files)

After a little studying the PDF standard, one can suggest the following concealment methods.

Each object is alternated in a certain way, thereby changing the structure of the document. The guys and I called it “structural steganography,” because you change the structure of the document without changing the content. If you have n objects, then you may end up with n! various ordering, therefore you can transfer no more than log ₂ (n!) data bits. The idea is interesting, but we have postponed it until better times.
You can play with the versions of the files themselves. In the old (not used) version to make hidden information. However, we looked at 1000 different pdfs and in all there was not a single file with revision greater than 0 ...
You can find various ways to enter data provided by the protocol that are not displayed to the user.

The easiest way to do point 3 is ... comments. I really don't know for whom it was left; perhaps this is a legacy of PostScript , which is “legally” a programming language (like LaTeX) and, accordingly, its syntax provides comment lines, as in any PL. From the point of view of “refined” steganography - this, of course, is not security. However, the alleged enemy needs to know about the fact of concealment ...

However, there are cases when, with a cover-up, it does not make sense to hide the fact that there is a message. This will be an information cover-up, but not steganography.

Interspersing data:

the user sends the PDF document itself, the message to hide and a password to the SHP system input;
SHP uses the password to compute the stego key and the crypto switch. Information in the message is compressed and encrypted using a cryptographic key;
with the help of the steglyclue information is interspersed in a pdf document;
At the exit from the SHP system, the user receives a pdf document with interspersed data.

Extract data:

the user sends pdf and password to the system;
the system similarly calculates the stegokey and cryptokey by password;
the system retrieves data on the stegokey;
decrypts data with a crypto switch and decompresses it;
gives a message to the user.

That's all.

If the user enters an incorrect password, the SHP will incorrectly calculate the stego key and the crypto switch. Therefore, the user can be sure that without knowing the password no one else will receive the information from the pdf itself.

To those who did not notice at the beginning of a long time , I give once again a link to our kneaded web-platform: pdf.stego.su
~~(As you can see, instead of the standard black color in Django, we chose lovers toads . Yes, we are just design geniuses!)~~

What is it for?

At first it was just just4fun for me and acquiring skills and experience for my Paddawan students. However, later we had a number of ideas. That is why we publish this post, because we want to know the opinion of the professional IT community, especially security personnel.

Maybe everything we write is nonsense. In this case, if the reader has not yet abandoned the reading of this post, then we ask him to spend another 5-10 minutes to criticize the comments.

In one of my past posts, I talked about 15 practical goals of steganography (and information hiding) .
In fact, steganography in documents (and in particular in PDF documents) to some extent may be applicable to all tasks.

However, the most interesting is only 4.5 tasks.

0.5 Imperceptible transfer of information & hidden information storage.

As already mentioned - not security! However, against cyberblinds exactly work. For more serious steganography, you need to come up with a good algorithm for steganographic inclusions as such. Therefore, we count this task as 0.5, not 1.

In addition, the use of electronic documents can not be considered robust steganography because during the conversion (for example: pdf -> odt) information is lost.

The only thing where the idea of inconspicuous transmission can be claimed is in closed protocols. A kind of "security through obscurity" , only in steganography.

1.5 protection of exclusive rights

The sale of electronic journals is gaining momentum; various analytics and other paid subscriptions. The question arises: is it possible to somehow protect the content sold? To the characters publishing on the network was out of the ordinary? ..

You can try to disseminate information about the recipient in the sent document. For example: e-mail and payment card number, IP, login when registering at an online store, mobile phone, etc. For security and compliance, you can intersperse it in the form of hashes (+ salt) or simply inject some number (ID's in the system),
Thus, this number will say something only to the owner of the system.

In the case of the publication of a protected document, you can determine who exactly leaked this information.

Of course, a number of questions arise.

Can I remove the label?
Is it possible to fake a label and “substitute” another user?

If you use SHP, then this task should also be counted as 0.5, not 1.0 ...

However, you can try to find better and more reliable data hiding algorithms.
For example, the use of several concealment algorithms "not interfering with each other" allows you to build a single steganographic design, so to speak "multifactorial steganography" (also a term that is not a term).

The essence of “multifactorial steganography” is as follows: if at least one tag remains, ~~we can take the character by the balls,~~ we can determine who exactly published this paid content. In Japan, this is relevant.

2.5 Protecting the authenticity of the document.

The idea is very simple. We sign a document certifying our authorship. The difference from the huge zoo of similar systems lies in the fact that our signature is inalienable from the file itself.

However, there is a regular mechanism that does the same! (at least under the pdf protocol)
Therefore, we are late> __ <
But can similar reasoning be applied to other formats?

3.5 Decentralized EMS.

In principle, the "inalienability" of hidden data can be used
for decentralized electronic document management systems (EDMS).

But is it necessary?
It is clear that it is very convenient; peer-2-peer and in general - fashionable!
The main principle is inalienability of the document.
In modern EDMS, the signed document is signed only if it is inside the EDMS.
If you extract it and mail it to a third-party organization that does not have a solution for your SEDO, then you simply transfer the file.

Modern market SEDO remind messengers. If you are on Skype, and Vasya is on Telegram, then either you need to install Telegram, or Vasya Skype ... But imagine yourself a protocol of inclusion of information (or a set of protocols of inclusion for each protocol of electronic documents).

One for all! General!

If this protocol of embedding and extracting signatures would be one, just as SMTP and IMAP are the same for all mailers, this would be much more convenient.

Although I am not a specialist in EDMS. If there are specialists here, then please take some time and write in the comments what you think about this.

Is this idea relevant?

4.5 Watermark in DLP systems.

Imagine that you have a regime or “semi-mode” object (yes, there are some). You have information that you would not want to let outside, for example, the internal documentation of a product. You intersperse a certain label (or a label from a certain set). If the document goes "out of" the system, then DLP ( Data Leak Prevention ) checks for the presence of a label. If there is no label, the document passes; if there is - the system raises the alert.

Of course, this is not a panacea. But if the benefits of information hiding will be much greater than the price for developing this system, then why not introduce an optional (that is, additional ) measure?

In addition, from one type of "leakage" it will definitely help - from unintentional. There are cases when such documents are inadvertently sent that it would be better not to send (I hope this sad property is inherent only in “semi-mode”, and not regime objects ...)

Summing up

.

We were convinced that hiding data in documents is quite a real thing.
~~... Well, we have learned a lot of new things, as there are many where I have been digging ...~~

Of course, there are a number of questions.
Is it possible to make this concealment steganographically resistant? What will happen if the user translates everything from pdf, say, to jpeg? .. Will the hidden information be deleted? How critical is it? Will this problem solve multifactorial steganography ?

Is a statistical approach applicable to system quality analysis? That is, if the system protects in 90% of cases, and in 10% does not protect, then is it reasonable (as in cryptography) to say that the system does not protect at all? Or maybe there are business cases when even 90% will be enough to get some benefit? ..

Your point of view, the reader, is categorically welcomed in the comments - for the sake of this, this is long and was written.

Once again, link to the portal: pdf.stego.su
(+ PDF examples for experiments, who are too lazy to look for )
(we apologize in advance for possible habraeffekt)

Source: https://habr.com/ru/post/301346/

All Articles