The work of our company is related to the development of Microsoft Outlook and Exchange Server, and therefore we love to dig into them. Today we’ll dig a little new Microsoft Outlook chip - a reminder of forgotten attachments. It would seem, what could be easier? We publish the texts of our “excavations” in English in the company's blog, and in Russian - exclusively for Habr. Go!Starting from version 2013, the user can enable automatic reminder of forgotten attachments in Microsoft Outlook:

When sending a letter in this case, the following warning may be displayed:
')

After playing a bit with the lyrics, you will find that:
- only English is supported (perhaps others will be added later);
- in case of typos in the “key” words, the reminder will not be (to be exact, Microsoft Outlook knows only one typo, the word “ATACHMENT”);
- not all “your” keywords are considered by the system as such;
- the algorithm is not simple, but it is clearly not perfect.
Looking ahead, let's say right away that the dictionary and the algorithm are hard-coded, and you cannot correct or add your keywords or exceptions. All you can do is either turn this feature on or turn it off.
As you play with the new feature, you will have questions for it. Why does the system react to “see picture” or “see gif” in the body of the letter, but not to “see photo” or “see pdf”? Needless to say, the system did not respond to file not attached, but reacted to file attached. How does it work?
How it works
The algorithm is implemented in the MSFAD.DLL library (we studied the file version dated November 1, 2013), which is located in the Microsoft Office folder. This library contains a single function “HasAttachments”, which transfers the subject and body of the letter. In response, the function returns a decision to warn the user or not. Library size is more than 300 kilobytes. Too much to just find another line in one line. Previously, 300 kilobytes of huge programs were placed. Does she no longer do anything except check the text for keywords?
But it really is. 86 kilobytes in the library is occupied by data directly related to text analysis. But you won't see keywords in the body of the library, even if you have a hex editor. The dictionary is stored compressed and contains about 650 keywords. But if the words, even in decoded form, occupy a little more than 5 kilobytes, what then takes another 80kb?
In response, the names of the functions that can be found in the library code can be prompted: ChunkGrammarRule, ChunkGrammarLevel, CompoundAnalyzer, StringAnalyzer, TemplateLexiconBasedStringAnalyzer, FlatLexiconStringAnalyzer, MorphLayerStringAnalyzer, ScriptStringAnalysthearch, aplicationAnalyzer, MorphLayerStringAnalyzer, ScriptStringAnalysthessArAntyAnalyzer, MorphLayerStringAnalyzer 80 kilobytes, this is data for a natural language processing system!
This is the scope! Almost artificial intelligence! But is it appropriate in this problem?
As others do
Reminders of forgotten attachments for 15 years already know how to display many plugins for Microsoft Outlook. For example, in the “Swiss Knife” of
MAPILab Toolbox for Outlook there is the “Attachments Forget” component, the settings of which are shown in the image.

It works very simply. There was a substring in the letter - get a warning. There is no analysis of natural language in it, and it is easier to “deceive” it.
Nevertheless, it works, in spite of all simplicity, quite effectively. Plus, it can be taught to suit your writing style and the languages you use. If you often send invoices by mail, you have two clicks to teach MAPILab Toolbox to respond to the phrase "see invoice." But the cool Microsoft Outlook 2013 natural language analyzer will not react to the phrase “see invoice” and never learn your writing style. There is no self-study in it.
Look deeper under the hood
Being initially quite intrigued and impressed by the new feature of Microsoft Outlook, we were left somewhat disappointed after practical tests.
There are “powerful” words that, if they are indicated in the empty body of the letter, trigger a warning. There are nine of them: ATTACHED, ATTACHMENT, ATTACHMENTS, FYI, ATTACHING, REATTACHING, ENCL, ENCLOSURE, ENCLOSURES. Some of these words form phrases that work very well in short phrases. For example, the phrase “WHUSGD YODJHHW IS ATTACHED” will work. But this is not much different from the MAPILab Toolbox algorithm. He also knows 10 words, and he can be taught ten more phrases.
Let us turn to the natural language analysis. The phrase “HE WAS VERY ATTACHED TO THE OLD LADY” will not trigger. But the phrase "THEY FOUND A FIRE IN THE ATTACHED GARAGE OF A SINGLE-FAMILY HOME" operation will occur. For an analyzer with a limited vocabulary, these phrases look like “HE WAS VERY ATTACHED TO THE * *” and “THEY FOUND A * IN THE ATTACHED * OF A * *” (asterisks are words unknown to the analyzer). The analyzer, apparently, was able to distinguish between “very tied” and “in the attached”. Here we see that the analyzer does a good job with the syntax, but the semantics is not subservient to it. A 650 word dictionary is not enough.
Now let's move away from the words associated with ATTACHMENT, and see how the analyzer copes. The not quite correct phrase “I SEND YOU THE FILE” does not cause triggers, even if FILE is replaced with other similar words. But the phrase "I AM SENDING YOU THE FILE" works. It should be noted that the analyzer knows English very well, and if you skip the article somewhere, then often even the obvious phrase stops triggering.
In the dictionary many words are assigned the same semantic code. For example, it is the same for CONTRACT, DOCUMENT, EXCEL, FILE, FORM, PHOTO, RESUME, SPREADSHEET, WORKBOOK, and some others. Therefore, replacing the word FILE in the last phrase will not affect anything. But the dictionary is limited, and we can easily find what to substitute so that there is no response. The phrases "I AM SENDING YOU THE BILL" or "I AM SENDING YOU THE NON-DISCLOSURE AGREEMENT" will not trigger.
Look into the dictionary
In the illustration below, a little more than one fifth of the dictionary, sorted by semantic code (CODE, its absolute value does not affect anything), is located. We took the beginning, middle and end of the dictionary:

The dictionary, in our opinion, is small for the problem being solved. Half of the vocabulary consists of the words required to parse the sentence. The second half is closely related to what can be sent by e-mail as an attachment. In this case, only the most popular words related to electronic attachments were included in the dictionary. There are no words like HOME, GIRL, CAR, WORLD, PEACE in the dictionary. Therefore, "ATTACHED GARAGE" and "ATTACHED STATEMENT" for the analyzer are exactly the same half unknown phrase.
The analyzer allows quite a few false positives of both first order (reacting to innocuous phrases) and second order false positives (without triggering phrases like “THIS EMAIL CONTAINS AN IMPORTANT ATTACHMENT”).
If we compare the algorithm used with a primitive search for keywords, then their results are quite comparable. Why did Microsoft choose such a difficult path and wrote a thousand times more code for a not very important task?
Is Google to blame?
Attachment Reminder appeared in Gmail in 2010 (before that, he had been in Gmail Labs for two years). A similar feature appeared in HotMail (now Outlook.com) a year later. The rivalry between the two giants manifests itself even in small things. And if Google did something “just”, then Microsoft will do it cool to smile condescendingly.
In 2009, a German technical university published an article
“Learning to Recognize Missing E-mail Attachments” , which provided data on the superiority of the learning algorithms over the static keyword method. Perhaps it was she who put the idea in Microsoft to create a "smart" Attachments Reminder. Microsoft has a huge database of letters, the result of technology can be used on Outlook.Com, and in Microsoft Outlook, and probably even in mobile applications.
This is how Attachment Reminders worked on some test phrases in Microsoft Outlook 2013 and popular online services (Yes — the warning was green — the system was not mistaken):

For this mini-test can not confidently argue about the algorithms used. But you can reasonably assume that Gmail uses the primitive method of static keywords. It clearly works on the phrase "I HAVE ATTACHED" and "IS ATTACHED", regardless of semantics and syntax. Outlook.com also works on this method, but it works on a larger number of key phrases than Gmail. Apparently, the advanced technologies used in Microsoft Outlook 2013 have not yet reached him.
And only Microsoft Outlook 2013 demonstrates an attempt to analyze the text. But doing this is not always successful. And in the attached mini-test, he did not become a confident leader. By increasing the dictionary (at times), you can probably achieve a significant improvement in the quality of the algorithm.
But in terms of practical application, the method of static keywords with the ability to be set by the user will most likely provide better protection, since e-mails often use abbreviations of speech and words, professional jargon, correspondence is conducted in some context and therefore full text analysis is difficult. .
But, in any case, Microsoft made a cool unusual thing, which was very interesting to learn. Let's see what it will be in a few years! We also studied the version of MSFAD.DLL dated July 16, 2014, which was released as part of the update KB2883094 (the latest available at the time of this writing). In the new version, the dictionary and data for parsing have not changed, and the algorithms have not changed either. It was just a bugfix. So active work at Microsoft on Attachments Reminder recently, apparently not being. And the real update has to wait very soon.