The guys from the French company Applidium exactly a month after the release of the iPhone 4S reversed the Siri personal assistant protocol. Below is a translation of the
article , which describes the process of reverse engineering and the interesting facts that have been revealed to the researchers.
October 14, 2011 Apple introduced the new iPhone 4S. One of the new features was the Siri system - a personal assistant. Siri handles requests in natural language to interact with the user.
Apple narrated that Siri
sends data to a remote server (probably, therefore, Siri only works over 3G or WiFi). As soon as our hands got a brand new iPhone 4S, we decided to find out how this thing works.
Today, we still managed to break and open the Siri protocol. As a result, we can now use the Siri recognition engine on any device. Yes, this means that now anyone can write an application for Android, which will use the very Siri! Or use Siri on the iPad. And we want to share our knowledge with you.
')
Demonstration
Probably, this is the most successful demo function for converting speech to text via Siri. We created a simple sound recording in which the phrase “autonomous demo of Siri” is pronounced (autonomous demo of Siri), and we got a wonderful result!
Sample_Siri_speech_to_text.zip (70.78K)
This sample never went through any iPhone, but despite this, we managed to get Siri to analyze it for us.
Digging into the protocol - a brief technical introduction
In Applidium we make mobile apps. The best way to communicate with a remote server is HTTP, since this protocol works almost everywhere and always, in many ways.
The easiest way to intercept HTTP traffic is to create a proxy server controlled by you, configure iPhone to use it, and look at what passes through our proxy. Thus, we pinned tcpdump to the network gateway, and realized that the Siri traffic was sent over TCP to port 443, on the server 17.174.4.4.
Turning to
17.174.4.4 on the desktop, we noticed that this server provides the certificate guzzoni.apple.com. So it turned out that Siri was accessing the guzzoni.apple.com server via the HTTPS protocol.
As you know, “S” in HTTPS means “secure”: all traffic between the client and the server is encrypted with HTTPS. Therefore, we could not just read it using a sniffer. In this case, the simplest solution is to fake HTTPS using a fake DNS server, and look at what comes to our server. The guys behind Siri did everything right: they check that the guzzoni certificate is valid, so you don’t have to pretend. Well ... they made a check that it is valid, but you can add your own "root certificate", which will allow you to mark any certificate and make it valid.
Therefore, all we needed was to install a custom SSL certificate authority, add it to our iPhone 4S, and sign with our help a purely our fake certificate, like “guzzoni.apple.com”. And it worked: Siri sent commands to our own HTTPS server! Looks like someone at Apple missed that detail though.
It was then that we realized that the Siri protocol is opaque. Let's look at the Siri HTTP request. The request body is binary (we'll talk about this later), and here are the headers:
ACE /ace HTTP/1.0
Host: guzzoni.apple.com
User-Agent: Assistant(iPhone/iPhone4,1; iPhone OS/5.0/9A334) Ace/1.0
Content-Length: 2000000000
X-Ace-Host: 4620a9aa-88f4-4ac1-a49d-e2012910921
Some interesting things you can learn from this:
- The request uses the custom “ACE” method, rather than the more familiar GET.
- The URL is also requested as “/ ace”.
- Content-Length is almost 2 GB. Which obviously does not conform to the HTTP standard.
- X-Ace-host is a bit like a GUID. After experimenting with several iPhone 4S, we realized that this value is most likely associated with a real device (very similar to UDID).
Now let's go to the body. The body contains raw binary data. When we first looked at him through the eyes of a hex editor, we noticed that it always started with 0xAACCEE. It seems to be the headline! Unfortunately, we did not understand anything of what was after him.
That's when we took some time to think. As people who are engaged in the development of mobile applications, we know; There is one thing that is very important when it comes to working with the network - it is compression. The bandwidth is often limited, therefore, as a rule, it is a very good idea to compress everything that we transmit. And what is the most common library for data compression? zlib:
zlib.net . This library is really efficient and powerful (and of course, it is half French!). Therefore, we tried to pump our binary data through zlib. But it didn't work out, we lacked a zlib header. That's when we thought: “hmmm, we already have this AACCEE header in the request body. Maybe there is something else? ". We, developers, like to keep data packed. Three bytes is not a good length for the header. Let there be four. Thus, we tried to unpack the data after the fourth byte. And it worked!
Now, when we unpacked the data, we received some new binary data. It is not very clear how, but some parts of this data were text. Among them, our attention was drawn to bplist00. Hooray! This is probably some kind of binary plist data. After we played enough with this binary stream, we realized that it consisted of the following parts:
- Parts starting with 0x020000xxxx are “plist” packages, xxxx is the size of the plist binary data followed by a header.
- Parts starting with 0x030000xxxx are “ping” packets sent by the iPhone to Siri servers to maintain the connection. Here xx is the ping number.
- Parts beginning with 0x040000xxxx are “pong” packages sent by the Siri server as responses to ping packages. Similarly, xx is the number of the pong-sequence number.
Decoding binary plist content is very simple; you can do it on Mac OS X using the “plutil” command (via the command line). Or in Ruby using the CFPropertyList gem on any other platform.
What we learned
We really learned a few interesting things about how the iPhone 4S communicates with Apple servers:
Audio data
The iPhone 4S does send raw audio to the server. It is compressed with the
Speex audio codec, which makes sense because this codec is specifically designed for VoIP.
Identification
The iPhone 4S sends ids everywhere. So if you want to use Siri on a different device, you still have to have the ID of at least one iPhone 4S. Of course, we will not publish ours, but it is very easy to get it using the tools we have already written about. Of course, Apple can in theory blacklist the identifier, but as long as you use it for personal use, everything should be fine.
Actual content
The protocol is, in fact, very, very talkative. Your iPhone sends tons of things to the Apple server. And these servers respond to him with an incredible amount of information. For example, when you use text-to-speech, the Apple server even sends a trust rating and timestamp for each word.
What's next?
Here is a collection of tools that we wrote to help us understand the protocol. They are written mainly in Ruby (because it is an amazingly simple language), some parts in C, and some in Objective-C. Their development is not really finished, but this should be sufficient for those who can technically write a Siri application.
Let's see what fun things you do with Siri! And let's see how long Apple will need to change their security policy.