📜 ⬆️ ⬇️

Do not forget about language and cultural features

Sooner or later, all are faced with problems associated with linguistic and cultural diversity in writing programs. I was very surprised to learn that some of my friends who write in C ++ solve these problems with their bikes. For those who still do not know what std :: locale is, I would like to briefly show how to work with it using an example and what happens if I forget about it ...

std :: locale (localization) is an object that allows to take into account the cultural and linguistic characteristics of users. In essence, this is a container of special classes — facets , to which the program refers if it needs to perform actions that depend on natural language. The program entrusts such actions to localization facets. Any custom facets can be added to the localization. But the most interesting are the standard ones, since they are implemented in any localization and can be permanently or temporarily replaced:

In reality, we constantly use facets, even without knowing it. The standard template library uses lacalization for I / O. boost :: regex for case conversion, etc. Localization is set by the platform. Users of * nix systems are familiar with such strings as "ru_RU: UTF-8", "en_US.UTF-8" - these are the names of localizations in the platform. The program uses custom localization. If localization is not specified by the user, “classic” is used.

Example of using localization and cell overrides


Consider an example in which we will try the technique of replacing the standard localization facet. Usually, streaming I / O is considered, but I would like to focus on what can happen if you write code that is dependent on localization, without knowing what it is. Let's try using locales with the common library boost :: xpressive (you can use boost :: regex, but for those who first hear about xpressive, it will be useful to read about it):

  1. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  2. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  3. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  4. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  5. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  6. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  7. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  8. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  9. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  10. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  11. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  12. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  13. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  14. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
  15. #include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }

')
Some will be surprised that the issuance of the program strongly depends on the platform . Moreover, on one platform, the program may produce different results. It's all about the locale. If we assume that the encoding of the example file is windows-1251, then the “icase fail” result can be achieved on a platform in which the user locale has an encoding different from cp1251. The most common example of such a platform is mingw (downloaded as a sourceforge binary) + Windows. In this case, the boost :: xpressive algorithms simply do not know which characters in the extended part of the cp-1251 code table are letters. And the ctype facet of classical localization is to blame for this. Having reported the correct ctype of localization with which xpressive works, we will achieve the desired result. In the simplest case, if the necessary localization is installed in the system, we just need to make it global

  1. // set global localization
  2. std :: locale cp1251_locale ( "en_US.CP1251" ) ;
  3. std :: locale :: global ( cp1251_locale ) ;

or report it to the regex compiler
  1. std :: locale cp1251_locale ( "en_US.CP1251" ) ;
  2. sregex_compiler compiler ;
  3. // tell the regex compiler what localization to use
  4. compiler. imbue ( cp1251_locale ) ;
  5. sregex xpr = compiler. compile ( "world" , regex_constants :: icase ) ;


Everything would be fine, but on a platform where localization is not supported ru_RU: CP1251 our code will throw an exception. At best, the name is incorrectly specified, at worst - the necessary localization is not in the system. We will solve this problem by implementing our own ctype facet (it is he who will explain to xpressive which characters are letters and how the case changes).

The simplest example of implementing the ctype facet and the example for encoding CP1251:

  1. #include <boost / xpressive / xpressive.hpp>
  2. #include <iostream>
  3. using namespace std ;
  4. using namespace boost :: xpressive ;
  5. / ** @ brief A very simplified example of the ctype facet for working correctly with
  6. * Encoding Cp1251 * /
  7. class ctype_cp1251 : public ctype < char >
  8. {
  9. public :
  10. / ** @ breif mask in ctype_base is an enumeration of all possible types.
  11. * characters - alpha, digit, ... * /
  12. typedef typename ctype < char > :: ctype_base :: mask mask ;
  13. // for brevity, let's rename the constants
  14. enum {
  15. alpha = ctype < char > :: alpha ,
  16. lower = ctype < char > :: lower ,
  17. punct = ctype < char > :: punct
  18. // other masks
  19. } ;
  20. / ** @ brief Main constructor. r - characterizes the area of ​​life
  21. * cell. For details, see the book of Straustrup. * /
  22. ctype_cp1251 ( size_t r = 0 )
  23. {
  24. // initialize the table of masks. The index is the negative part of the char.
  25. // That is, ext_tab [1] - the mask for the character char (-1) - 'I'
  26. ext_tab [ 0 ] = 0 ;
  27. for ( size_t i = 1 ; i <= 32 ; ++ i )
  28. ext_tab [ i ] = alpha | lower ;
  29. for ( size_t i = 33 ; i <= 64 ; ++ i )
  30. ext_tab [ i ] = alpha | upper ;
  31. // ... the rest of the characters in this example are uninteresting
  32. for ( size_t i = 65 ; i <= 128 ; ++ i )
  33. ext_tab [ i ] = punct ;
  34. }
  35. ~ ctype_cp1251 ( )
  36. { }
  37. protected :
  38. / ** @ brief Answers the question whether the character c matches the mask m * /
  39. virtual bool is ( mask m, char c ) const
  40. {
  41. if ( 0 <= c && c <= 127 )
  42. return ctype < char > :: is ( m, c ) ;
  43. else if ( - 128 <= c && c < 0 )
  44. return ext_tab [ static_cast < size_t > ( c * - 1 ) ] & m ;
  45. }
  46. / ** @ brief Converts the character c to uppercase * /
  47. virtual char do_toupper ( char c ) const
  48. {
  49. if ( 0 <= c && c <= 127 )
  50. return ctype < char > :: do_toupper ( c ) ;
  51. else if ( is ( lower, c ) )
  52. return c - 32 ;
  53. return c ;
  54. }
  55. / ** @ brief Converts the character c to lower case * /
  56. virtual char do_tolower ( char c ) const
  57. {
  58. if ( 0 <= c && c <= 127 )
  59. return ctype < char > :: do_tolower ( c ) ;
  60. else if ( is ( upper, c ) )
  61. return c + 32 ;
  62. return c ;
  63. }
  64. // not to complicate the example, we will not override the rest
  65. // virtual functions
  66. private :
  67. // ban on copying
  68. ctype_cp1251 ( const ctype_cp1251 & ) ;
  69. const ctype_cp1251 & operator = ( const ctype_cp1251 & ) ;
  70. mask ext_tab [ 129 ] ; // @ <masks of the extended part of the code table CP1251
  71. } ;
  72. int main ( int argc, char * argv [ ] )
  73. {
  74. // create an instance of the cell
  75. ctype < char > * ctype_cp1251_facet = new ctype_cp1251 ( ) ;
  76. // Create a new localization based on the current, using
  77. // defined above facet. You can define a global
  78. // localization with the described facet, then all the classes and
  79. // functions will use it.
  80. locale cp1251_locale ( locale ( "" ) , ctype_cp1251_facet ) ;
  81. // create regex compilers with specific localization
  82. sregex_compiler compiler ;
  83. compiler. imbue ( cp1251_locale ) ;
  84. sregex xpr = compiler. compile ( "world" , regex_constants :: icase ) ;
  85. smatch match ;
  86. string str ( "HELLO WORLD!" ) ;
  87. if ( regex_search ( str, match, xpr ) )
  88. cout << "icase ok" << endl ;
  89. else
  90. cout << "icase fail" << endl ;
  91. return 0 ;
  92. }


Now the result of the program will not depend on the specific platform. By overriding the standard facets or adding new ones, you can control the behavior of the algorithm / program depending on the cultural and linguistic characteristics of users.

A complete description of the std :: local class and techniques for using facets can be found in the 3rd special edition of Bjørn Straustrup’s C Programming Language ++, in the appendix. To clarify the structure of the facets, you can use any STL manual. For example here .

The task of converting encodings is solved by implementing the codecvt facet. If it is interesting, I will tell about it in the next article.

______________________
The text was prepared in the Blog Editor from © SoftCoder.ru

Source: https://habr.com/ru/post/104417/


All Articles