\
), the group is wrapped in curly braces ( {
and }
).a
to z
) and can be completed with a numerical parameter (possibly negative). Alternatively, a word may contain one non-alphanumeric ascii symbol. Anything that does not fall under these rules is not part of the control word. Thus, the sequence of the form \rtf1\ansi\ansicpg1251
is easily divided into three words rtf
with parameter 1 (major-format version), ansi
(current encoding) and ansicpg
with parameter 1251 (current code page numbered 1251 - i.e. Windows-1251).This is {\b bold} text
, This is \b bold \b0 text
= This is bold text
.\ansicpg
), of course. For this, a sequence of the form \'hh
was entered into the RTF, where hh is the binary hex code of the character from the ASCII table.\uABCD
with the digital parameter ABCD is included in the format. ABCD in this case is the unicode character code in the decimal number system. Everything is simple again, as you can see.\ucN
, which is closely related to Unicode. The fact is that the RTF format very zealously maintains compatibility with old devices on which you may have to open this file. As an option, a similar device (well, for example, a computer with Windows 3.11 :) will not be able to read Unicode, what should it do? To do this, after each unicode character encrypted with the \u
keyword, you can specify from zero to several characters that should be displayed if the rtf-viewer is unable to display or parse the current data (according to the documentation, if the viewer cannot display correctly data, he should skip them).Lab\u915GValue
. Let us ask ourselves - how many characters you want to display, if you can not show Unicode. Again, everything is not very difficult - the above keyword \ucN
as a parameter N just provides this value. Those. Before the Unicode data, there is something like \uc1
that will tell us to skip one character after unicode.You can get the code with comments on GitHub .
- function rtf_isPlainText ( $ s ) {
- $ failAt = array ( "*" , "fonttbl" , "colortbl" , "datastore" , "themedata" ) ;
- for ( $ i = 0 ; $ i < count ( $ failAt ) ; $ i ++ )
- if ( ! empty ( $ s [ $ failAt [ $ i ] ] ) ) return false ;
- return true ;
- }
- function rtf2text ( $ filename ) {
- $ text = file_get_contents ( $ filename ) ;
- if (! strlen ( $ text ) )
- return "" ;
- $ document = "" ;
- $ stack = array ( ) ;
- $ j = - 1 ;
- for ( $ i = 0 ; $ i < strlen ( $ text ) ; $ i ++ ) {
- $ c = $ text [ $ i ] ;
- switch ( $ c ) {
- case " \\ " :
- $ nc = $ text [ $ i + 1 ] ;
- if ( $ nc == '\\' && rtf_isPlainText ( $ stack [ $ j ] ) ) $ document . = '\\' ;
- elseif ( $ nc == '~' && rtf_isPlainText ( $ stack [ $ j ] ) ) $ document . = '' ;
- elseif ( $ nc == '_' && rtf_isPlainText ( $ stack [ $ j ] ) ) $ document . = '-' ;
- elseif ( $ nc == '*' ) $ stack [ $ j ] [ "*" ] = true ;
- elseif ( $ nc == "'" ) {
- $ hex = substr ( $ text , $ i + 2 , 2 ) ;
- if ( rtf_isPlainText ( $ stack [ $ j ] ) )
- $ document . = html_entity_decode ( "& #" . hexdec ( $ hex ) . ";" ) ;
- $ i + = 2 ;
- } elseif ( $ nc > = 'a' && $ nc <= 'z' || $ nc > = 'A' && $ nc <= 'Z' ) {
- $ word = "" ;
- $ param = null ;
- for ( $ k = $ i + 1 , $ m = 0 ; $ k < strlen ( $ text ) ; $ k ++, $ m ++ ) {
- $ nc = $ text [ $ k ] ;
- if ( $ nc > = 'a' && $ nc <= 'z' || $ nc > = 'A' && $ nc <= 'Z' ) {
- if ( empty ( $ param ) )
- $ word . = $ nc ;
- else
- break ;
- } elseif ( $ nc > = '0' && $ nc <= '9' )
- $ param . = $ nc ;
- elseif ( $ nc == '-' ) {
- if ( empty ( $ param ) )
- $ param . = $ nc ;
- else
- break ;
- } else
- break ;
- }
- $ i + = $ m - 1 ;
- $ toText = "" ;
- switch ( strtolower ( $ word ) ) {
- case "u" :
- $ toText . = html_entity_decode ( "& # x" . dechex ( $ param ) . ";" ) ;
- $ ucDelta = @ $ stack [ $ j ] [ "uc" ] ;
- if ( $ ucDelta > 0 )
- $ i + = $ ucDelta ;
- break ;
- case "par" : case "page" : case "column" : case "line" : case "lbr" :
- $ toText . = " \ n " ;
- break ;
- case "emspace" : case "enspace" : case "qmspace" :
- $ toText . = "" ;
- break ;
- case "tab" : $ toText . = " \ t " ; break ;
- case "chdate" : $ toText . = date ( "mdY" ) ; break ;
- case "chdpl" : $ toText . = date ( "l, j F Y" ) ; break ;
- case "chdpa" : $ toText . = date ( "D, j M Y" ) ; break ;
- case "chtime" : $ toText . = date ( "H: i: s" ) ; break ;
- case "emdash" : $ toText . = html_entity_decode ( "& mdash;" ) ; break ;
- case "endash" : $ toText . = html_entity_decode ( "& ndash;" ) ; break ;
- case "bullet" : $ toText . = html_entity_decode ( "& # 149;" ) ; break ;
- case "lquote" : $ toText . = html_entity_decode ( "& lsquo;" ) ; break ;
- case "rquote" : $ toText . = html_entity_decode ( "& rsquo;" ) ; break ;
- case "ldblquote" : $ toText . = html_entity_decode ( "& laquo;" ) ; break ;
- case "rdblquote" : $ toText . = html_entity_decode ( "& raquo;" ) ; break ;
- default :
- $ stack [ $ j ] [ strtolower ( $ word ) ] = empty ( $ param ) ? true : $ param ;
- break ;
- }
- if ( rtf_isPlainText ( $ stack [ $ j ] ) )
- $ document . = $ toText ;
- }
- $ i ++;
- break ;
- case "{" :
- array_push ( $ stack , $ stack [ $ j ++ ] ) ;
- break ;
- case "}" :
- array_pop ( $ stack ) ;
- $ j -;
- break ;
- case '\ 0' : case '\ r' : case '\ f' : case '\ n' : break ;
- default :
- if ( rtf_isPlainText ( $ stack [ $ j ] ) )
- $ document . = $ c ;
- break ;
- }
- }
- return $ document ;
- }
\*
). Secondly, it is worthwhile to parse the encoding and code page in order to more accurately display keywords like \'hh
.Source: https://habr.com/ru/post/70119/
All Articles