VoIP telephony is gradually gaining ground with traditional copper-wire telephone systems, as it provides higher throughput at a lower deployment cost. In 2013, the number of VoIP subscribers amounted to more than 150 million , which is a lot in itself; and in 2017, almost 1 billion . But what about the privacy of VoIP calls? Is the end-to-end encryption used in VoIP software able to provide this same privacy? Such issues became especially topical after the revelations of Snowden , who told the world about the total wiretapping, which is carried out by government intelligence agencies like the NSA (National Security Agency) and the CPS (government communications center) using spyware PRISM and BULLRUN, which also listens for encrypted conversations .
- What an attacker can extract from an encrypted audio stream
- Attack on VoIP over bypass channels
- A few words about the DTW algorithm
- The principle of operation of HMM machines
- The principle of operation of PHMM-machines
- From theory to practice: recognition of the language of the conversation
- Listening to Skype's encrypted audio stream
- And if you turn off the VBR mode?
How does PRISM, BULLRUN and other similar software extract information from the voice stream transmitted over encrypted channels? In order to understand the answer to this question, you must first understand how voice traffic is transmitted in VoIP. The data channel in VoIP systems is usually implemented over the UDP protocol, and most often works using the SRTP protocol (Secure Real-time Transport Protocol; real-time secure data transfer protocol), which supports packaging (via audio codecs) and audio stream encryption. In this case, the encrypted stream that is obtained at the output is the same size as the input audio stream. As will be shown below, such seemingly insignificant information leaks can be used to listen to “encrypted” VoIP conversations .
Most of the audio codecs used in VoIP systems are based on the CELP algorithm (Code-Excited Linear Prediction; code linear prediction), the functional blocks of which are shown in Figure 1. In order to achieve better sound quality, without increasing the load on a data channel, VoIP software usually uses audio codecs in VBR mode (Variable bit-rate; audio stream with a variable bit rate). By this principle, for example, Speex audio codec works.
Figure 1. Functional blocks of the CELP algorithm
What does this lead to in terms of confidentiality? A simple example ... Speex, working in VBR mode, packs hissing consonants with a lower bit rate than vowels; moreover, even certain vowels and consonants are packed with a specific bitrate for them (see Fig. 2.a). The graph in Figure 2.b shows the distribution of packet lengths - for a phrase that has hissing consonants: “Speed ​​skaters sprint to the finish”. The deep hollows of the graph fall precisely on the hissing fragments of this phrase. Figure 2.c shows the dynamics of the input audio stream, bit rate and size of the output (encrypted) packets, - superimposed on a common time scale; striking similarity of the second and third graphs - can be seen with the naked eye.
Figure 2. How hissing sounds affect packet size
Plus, if you look at Figure 2 through the prism of the mathematical apparatus of digital signal processing (which is used in speech recognition problems), such as a PHMM machine (Profile Hidden Markov Models; an expanded version of the hidden Markov model), you can see much more than just difference between vowels and consonants. Including , to identify the gender, age, language and emotions of the speaker.
A PHMM-machine does a very good job of processing numerical chains, comparing them with each other and finding patterns between them. That is why the PHMM-machine is widely used in solving speech recognition problems.
In addition, the PHMM machine is also useful in the task of listening to an encrypted audio stream. But not directly, but through bypass channels. In other words, a PHMM-machine cannot directly answer the question: “What phrase is in this chain of encrypted audio packets?”, But it can accurately answer the question: “Is such a phrase contained in such and such a place is the encrypted audio stream? ”
T.O. A PHMM machine can only recognize phrases that it was originally trained on. However, modern deep learning technologies are so powerful that they are able to train a PHMM machine to such an extent that for it the line between the two questions voiced just above is blurred. To appreciate the full power of this approach, you need to dive slightly into the materiel.
The DTW algorithm (Dynamic Time Warping; dynamic transformation of the timeline) until recently has been widely used in solving problems of speaker identification and speech recognition. He is able to find similarities between two numerical chains generated according to the same law - even when these chains are generated at different speeds and are located in different places on the time scale. This is exactly what happens when digitizing the audio stream: for example, the speaker can utter the same phrase with the same accent, but at the same time faster or slower with different background noise. But this does not prevent the DTW algorithm from finding similarities between the first and second options. To illustrate this point with an example, consider two integer chains:
0 0 0 4 7 14 26 23 8 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 6 13 25 24 9 4 2 0 0 0 0 0
If we compare these two chains “on the forehead”, then they are apparently very different from each other. However, if we compare their characteristics, we will see that the chains definitely have some similarities: both of them consist of 8 integers; both have a similar peak value (25-26). The “head-on” comparison, starting from their entry points, ignores these important characteristics of them. But the DTW algorithm, comparing the two chains, takes into account these and their other characteristics. However, we will not focus much on the DTW algorithm, since today there is a more effective alternative - PHMM-machines. It was experimentally established that PHMM-machines "recognize" phrases from an encrypted audio stream with 90 percent accuracy; whereas the DTW algorithm gives only an 80 percent guarantee. Therefore, the DTW algorithm (which in its heyday was a popular tool in solving speech recognition problems) is mentioned only in order to show how much better, in comparison with it, PHMM machines (in particular in solving the problem of recognizing an encrypted audio stream) . Of course, the DTW algorithm, in comparison with PHMM machines, learns much faster. This advantage is undeniable. However, with modern computing power, this advantage is not fundamental.
HMM (just HMM, not PHMM) is a statistical modeling tool that generates numerical chains following the system defined by a deterministic finite state machine, each of whose transient functions is a so-called "Markov process." The operation of this automaton (see Fig. 3) always begins with state “B” (begin) and ends with state “E” (end). The choice of the next state to which the transition from the current will be carried out is carried out in accordance with the transition function of the current state. As you move between the states, the HMM-machine at each step produces one number, from which the output chain of numbers is formed step by step. When the HMM machine is in the “E” state, the chaining ends. Using an HMM machine, one can find patterns in chains that look random at the outset. For example, here this advantage of the HMM machine is used to find patterns between the chain of packet lengths and the target phrase, the presence of which we check in an encrypted VoIP stream.
Figure 3. An example of an HMM machine
Although there are a large number of possible ways that the HMM-machine can go from point “B” to point “E” (in our case, when packing a single audio fragment), it’s still for each specific case (even for such a random one as “ Markov process ”) is one single best way , one single best chain. She is the most likely candidate, who is most likely to choose an audio codec when packing the corresponding audio fragment (because her uniqueness is expressed also in the fact that she lends herself better to packing than others). Such “best chains” can be found using the Viterbi algorithm (as is done here for example).
In addition, in speech recognition tasks (including from an encrypted data stream, as in our case), in addition to being able to find the best path for the observed chain, it is also useful to be able to calculate how likely it is that the selected chain will be generated by the HMM-machine. A concise solution to this problem is given here ; it relies on the Forward-Back algorithm and the Baum-Welsh algorithm .
Here, on the basis of the HMM-automaton, a method for identifying the language in which the conversation is being developed is developed; with an accuracy of 66%. But such low accuracy is not very impressive, therefore there is a more advanced modification of the HMM-machine - PHMM, which draws much more patterns from the encrypted audio stream. So, for example, here it is described in detail how to identify words and phrases using encrypted traffic using a PHMM-machine (and this task will be more difficult than simply identifying the language in which the conversation is being conducted); with an accuracy of 90%.
PHMM is an improved modification of the HMM machine, in which (see Fig. 4.a), in addition to the states of “correspondence” (squares with the letter M), there are also states of “insertion” (rhombs with the letter I) and “delete” (circles with the letter D). Due to these two new states, PHMM-automata, in contrast to HMM-automata, are able to recognize the hypothetical chain “ABCD”, even if it is not fully present (for example, “ABD”) or an insert is made into it (for example, “ABXCD”). In solving the problem of recognizing an encrypted audio stream, these two innovations of the PHMM machine are particularly useful. Because the output of the audio codec rarely matches, even when the audio inputs are very similar (when, for example, the same person says the same phrase). T.O. The simplest model of a PHMM machine consists of three interconnected chains of states (“correspondence”, “insertion”, and “deletion”) that describe the expected length of network packets at each position of the chain (encrypted VoIP traffic packets for the selected phrase).
Figure 4. Example of a PHMM machine
However, since the network packets in which the target phrase is packaged in the encrypted audio stream are usually surrounded by other network packets (the rest of the conversation), we need an even more advanced PHMM machine. One that can isolate the target phrase from other sounds surrounding it. Here, for this, 5 new states are added to the original PHMM machine (see Fig. 4.b). The most important of these five added states is “random” (a diamond with the letter R). A PHMM-machine (after completion of the training stage) goes into this state when packets that are not part of the phrase we are interested in get to it. PS (Profile Start) and PE (Profile End) states - provide a transition between a random state and the profile part of the model. Such an improved modification of the PHMM machine is able to recognize even those phrases that the machine “did not hear” at the training stage (see Fig. 5).
Figure 5. PHMM-machine solves the problem of encrypted audio stream recognition
Here is an experimental setup based on a PHMM machine (see Fig. 6), which was used to analyze encrypted audio streams with speech from 2,000 native speakers from 20 different language groups. After completing the training process, the PHMM-machine identified the language of conversation with an accuracy of 60 to 90%: for 14 out of 20 languages, the accuracy of identification exceeded 90%, for the rest - 60%.
The experimental setup shown in Figure 6 includes two Linux PCs with OpenSource VoIP software. One of the machines works as a server and listens for SIP calls on the network. After receiving the call, the server automatically answers the subscriber, initializing the speech channel to the “Speex over RTP” mode. It should be mentioned here that the control channel in VoIP systems is usually implemented over the TCP protocol, and works either on some of the publicly available protocols with an open architecture (SIP, XMPP, H.323), or has a closed architecture specific for a particular applications (as in Skype, for example ).
Figure 6. An experimental setup for working with a PHMM machine
When the voice channel is initialized, the server plays the file to the caller and then terminates the SIP connection. The subscriber, which is another machine on our local network, makes a SIP call to the server, and then, using the sniffer, “listens” to the file that the server plays: it listens to the chain of network packets with encrypted audio traffic that come from the server. Further, the subscriber either trains the PHMM-machine to identify the language of conversation (using the mathematical apparatus described in the previous sections), or “asks” the PHMM-machine: “What language is the conversation in?” As already mentioned, this experimental setup ensures the accuracy of language identification - up to 90%. The process of training a PHMM machine will be described in detail in the next section (in the example with Skype).
It demonstrates how to solve an even more complex problem using a PHMM machine: recognize the encrypted audio stream generated by Skype (which uses the Opus / NGC audio codec in VBR mode; and 256-bit AES encryption). The presented development works on the principle shown in Figure 5. For this, it uses an experimental setup similar to that shown in Figure 6. But only with the Skype codec Opus.
To train their PHMM-machine, the researchers used the following sequence of steps: 1) first they collected a set of sound tracks, including all the phrases of interest to them; 2) then installed the network packet sniffer, and initiated a voice conversation between two Skype accounts (this manipulation led to the generation of encrypted UDP traffic between two machines, in P2P mode); 3) then they played each of the collected sound tracks in a Skype session using a media player; with five second intervals of silence between tracks; 4) in the meantime, the packet sniffer was configured to register all traffic entering the second machine of the experimental setup (see Fig. 6). After collecting all the training data, the UDP packet length chains were extracted using an automatic parser for PCAP files. The resulting chains, consisting of the lengths of the payload packets, were then used to train the PHMM model, using the Baum-Welsh algorithm .
It would seem that the problem of such leaks can be solved by switching the audio codecs to the constant bitrate mode (although what kind of solution is it - the bandwidth from this decreases sharply), but even in this case, the security of the encrypted audio stream still leaves much to be desired. Indeed, the exploitation of packet lengths of VBR traffic is just one example of an attack on bypass channels. But there are other examples of attacks, such as tracking pauses between words .
The task is of course non-trivial, but quite solvable . Why is it non-trivial? Because in Skype, for example, in order to coordinate the operation of the UDP protocol and NAT (network address translation; network address translation); and also to increase the quality of the transmitted voice - the transmission of network packets does not stop even when there are pauses in the conversation. This complicates the task of identifying pauses in speech.
However , an adaptive threshold value algorithm has been developed here that allows one to distinguish silence from speech with an accuracy of more than 80%; the proposed method is based on the fact that speech activity strongly correlates with the size of encrypted packets: more information is encoded in a voice packet when the user speaks than during the user's silence. And here (with an emphasis on Google Talk, Lella and Bettati), the speaker is identified, even when no leakage occurs through the size of the packets (even when the VBR mode is disabled). Here, researchers rely on measuring time intervals between packet receptions. The described method relies on the phases of silence, which are encoded into smaller packets, with longer time intervals - to separate words from each other.
T.O. even the most modern cryptography is not able to protect encrypted VoIP communications from listening, even if this cryptography is properly implemented, which is unlikely in itself. And it is also worth noting that in this article only one mathematical model of digital signal processing (PHMM-automata) is analyzed in detail, which turns out to be useful in recognizing encrypted audio stream (in such spy software of government special services as PRISM and BULLRUN). But there are dozens and hundreds of such mathematical models. So if you want to keep up to date - look at the world through the prism of higher mathematics.
Source: https://habr.com/ru/post/461759/
All Articles