Text at any cost: RTF

Well, we continue our research on the subject of obtaining text from various data formats. Not so long ago, we learned how to pull out text from a zipped-xml-based file (odt and docx), and also, from the beginning of this week, from pdf . Today we will continue with the promised rtf.

Rich Text Format (also known as rtf), ~~you might think, rather forgotten, though~~ not a very complex format for presenting text data. Well, relatively simple to get the text, but for its history: from its first version to the current 1.9.1 - it acquired under 300 pages of official documentation and a huge number of add-ons, which will most of all interfere with us when getting plain text. Let's try to get around them ...

What's inside?

As it happened, let's look inside the rtf file and see what's inside:
')

What do we see? ~~I see our favorite poem "Sail".~~ We see initially text 8-bit data format. It already pleases - when in the source data the text, it is much easier to understand what is happening. Now let's see how to read this very data. For this, I will tell a little theory on the topic.

We assume that rtf consists of control words that can be grouped into nested sets. The control word begins with a backslash ( \ ), the group is wrapped in curly braces ( { and } ).

The control word consists of a sequence of letters of the English alphabet (from a to z ) and can be completed with a numerical parameter (possibly negative). Alternatively, a word may contain one non-alphanumeric ascii symbol. Anything that does not fall under these rules is not part of the control word. Thus, the sequence of the form \rtf1\ansi\ansicpg1251 is easily divided into three words rtf with parameter 1 (major-format version), ansi (current encoding) and ansicpg with parameter 1251 (current code page numbered 1251 - i.e. Windows-1251).

Grouped sets define the scope of control words. Thus, control words described inside curly braces work only inside them and all child subsets. In order to properly work out what words are taking place now - it is required to maintain a stack of control words. When opening a curly bracket, create a new array element on the stack, into which you immediately add the data of the previous stack layer; when you close the bracket, delete the topmost layer.

It is also worth noting that some control words can be closed by adding the parameter zero, rather than creating a new subgroup. For example, the following options are equivalent: This is {\b bold} text , This is \b bold \b0 text = This is bold text .

Where to get the text?

We got acquainted with the device of a new format for us, now we will ask ourselves, and where to get the text. Everything is not as difficult as it may seem - the text should be taken where the current sequence is not identified as the control word. With a couple of exceptions, naturally.

Firstly, it is worth noting that the original encoding of the rtf file is ANSI, therefore, without any frills, only English text will be preserved. We are at least interested in the Russian text, and even better than Unicode, aren't they? What is true, the truth is - rtf though the old format, but amiss to preserve both of them.

So, in rtf'e there is the possibility of using the second half of the ASCII table, that is from 128 and higher. Given the current encoding (above the control word \ansicpg ), of course. For this, a sequence of the form \'hh was entered into the RTF, where hh is the binary hex code of the character from the ASCII table.

Well, the second, more interesting option is unicode-encoded data. For them, the concisely short keyword \uABCD with the digital parameter ABCD is included in the format. ABCD in this case is the unicode character code in the decimal number system. Everything is simple again, as you can see.

Simple, but not so. In rtf, there is another keyword \ucN , which is closely related to Unicode. The fact is that the RTF format very zealously maintains compatibility with old devices on which you may have to open this file. As an option, a similar device (well, for example, a computer with Windows 3.11 :) will not be able to read Unicode, what should it do? To do this, after each unicode character encrypted with the \u keyword, you can specify from zero to several characters that should be displayed if the rtf-viewer is unable to display or parse the current data (according to the documentation, if the viewer cannot display correctly data, he should skip them).

In this regard, most modern editors, after a unicode-control word, put a question symbol as a sign that needs to be shown instead of the current character. But options are also possible, for example: Lab\u915GValue . Let us ask ourselves - how many characters you want to display, if you can not show Unicode. Again, everything is not very difficult - the above keyword \ucN as a parameter N just provides this value. Those. Before the Unicode data, there is something like \uc1 that will tell us to skip one character after unicode.

Let's read!

It seems that the data we have accumulated will be enough to read our first rtf-files. Go:

function rtf_isPlainText ( $ s ) {
$ failAt = array ( "*" , "fonttbl" , "colortbl" , "datastore" , "themedata" ) ;
for ( $ i = 0 ; $ i < count ( $ failAt ) ; $ i ++ )
if ( ! empty ( $ s [ $ failAt [ $ i ] ] ) ) return false ;
return true ;
}
function rtf2text ( $ filename ) {
$ text = file_get_contents ( $ filename ) ;
if (! strlen ( $ text ) )
return "" ;
$ document = "" ;
$ stack = array ( ) ;
$ j = - 1 ;
for ( $ i = 0 ; $ i < strlen ( $ text ) ; $ i ++ ) {
$ c = $ text [ $ i ] ;
switch ( $ c ) {
case " \\ " :
$ nc = $ text [ $ i + 1 ] ;
if ( $ nc == '\\' && rtf_isPlainText ( $ stack [ $ j ] ) ) $ document . = '\\' ;
elseif ( $ nc == '~' && rtf_isPlainText ( $ stack [ $ j ] ) ) $ document . = '' ;
elseif ( $ nc == '_' && rtf_isPlainText ( $ stack [ $ j ] ) ) $ document . = '-' ;
elseif ( $ nc == '*' ) $ stack [ $ j ] [ "*" ] = true ;
elseif ( $ nc == "'" ) {
$ hex = substr ( $ text , $ i + 2 , 2 ) ;
if ( rtf_isPlainText ( $ stack [ $ j ] ) )
$ document . = html_entity_decode ( "& #" . hexdec ( $ hex ) . ";" ) ;
$ i + = 2 ;
} elseif ( $ nc > = 'a' && $ nc <= 'z' || $ nc > = 'A' && $ nc <= 'Z' ) {
$ word = "" ;
$ param = null ;
for ( $ k = $ i + 1 , $ m = 0 ; $ k < strlen ( $ text ) ; $ k ++, $ m ++ ) {
$ nc = $ text [ $ k ] ;
if ( $ nc > = 'a' && $ nc <= 'z' || $ nc > = 'A' && $ nc <= 'Z' ) {
if ( empty ( $ param ) )
$ word . = $ nc ;
else
break ;
} elseif ( $ nc > = '0' && $ nc <= '9' )
$ param . = $ nc ;
elseif ( $ nc == '-' ) {
if ( empty ( $ param ) )
$ param . = $ nc ;
else
break ;
} else
break ;
}
$ i + = $ m - 1 ;
$ toText = "" ;
switch ( strtolower ( $ word ) ) {
case "u" :
$ toText . = html_entity_decode ( "& # x" . dechex ( $ param ) . ";" ) ;
$ ucDelta = @ $ stack [ $ j ] [ "uc" ] ;
if ( $ ucDelta > 0 )
$ i + = $ ucDelta ;
break ;
case "par" : case "page" : case "column" : case "line" : case "lbr" :
$ toText . = " \ n " ;
break ;
case "emspace" : case "enspace" : case "qmspace" :
$ toText . = "" ;
break ;
case "tab" : $ toText . = " \ t " ; break ;
case "chdate" : $ toText . = date ( "mdY" ) ; break ;
case "chdpl" : $ toText . = date ( "l, j F Y" ) ; break ;
case "chdpa" : $ toText . = date ( "D, j M Y" ) ; break ;
case "chtime" : $ toText . = date ( "H: i: s" ) ; break ;
case "emdash" : $ toText . = html_entity_decode ( "& mdash;" ) ; break ;
case "endash" : $ toText . = html_entity_decode ( "& ndash;" ) ; break ;
case "bullet" : $ toText . = html_entity_decode ( "& # 149;" ) ; break ;
case "lquote" : $ toText . = html_entity_decode ( "& lsquo;" ) ; break ;
case "rquote" : $ toText . = html_entity_decode ( "& rsquo;" ) ; break ;
case "ldblquote" : $ toText . = html_entity_decode ( "& laquo;" ) ; break ;
case "rdblquote" : $ toText . = html_entity_decode ( "& raquo;" ) ; break ;
default :
$ stack [ $ j ] [ strtolower ( $ word ) ] = empty ( $ param ) ? true : $ param ;
break ;
}
if ( rtf_isPlainText ( $ stack [ $ j ] ) )
$ document . = $ toText ;
}
$ i ++;
break ;
case "{" :
array_push ( $ stack , $ stack [ $ j ++ ] ) ;
break ;
case "}" :
array_pop ( $ stack ) ;
$ j -;
break ;
case '\ 0' : case '\ r' : case '\ f' : case '\ n' : break ;
default :
if ( rtf_isPlainText ( $ stack [ $ j ] ) )
$ document . = $ c ;
break ;
}
}
return $ document ;
}

You can get the code with comments on GitHub .

Conclusion

What do we have in the end? This code will do right with most rtf files, but there are several ways to improve it. Firstly, it is worth adding additional clipping to non-text data - I only have to cut off fonts, color palette, theme, binary data, as well as everything marked as “don’t read me if you can't” ( \* ). Secondly, it is worthwhile to parse the encoding and code page in order to more accurately display keywords like \'hh .

What's next? Further I would like to touch on the formats of e-books, such as fb2, epub and the like. In this regard, I would like to ask for help from the readers: first, what other formats of e-books are worth seeing, and secondly, where you can find more files of the formats you specified. Thank you in advance :)

References:

Source: https://habr.com/ru/post/70119/

All Articles

Text at any cost: RTF

What's inside?

Where to get the text?

Let's read!

Conclusion

More articles: