Do not forget about language and cultural features

Sooner or later, all are faced with problems associated with linguistic and cultural diversity in writing programs. I was very surprised to learn that some of my friends who write in C ++ solve these problems with their bikes. For those who still do not know what std :: locale is, I would like to briefly show how to work with it using an example and what happens if I forget about it ...

std :: locale (localization) is an object that allows to take into account the cultural and linguistic characteristics of users. In essence, this is a container of special classes — facets , to which the program refers if it needs to perform actions that depend on natural language. The program entrusts such actions to localization facets. Any custom facets can be added to the localization. But the most interesting are the standard ones, since they are implemented in any localization and can be permanently or temporarily replaced:

collate (string comparison)
numeric (input / output numbers)
monetary (money input / output)
time (input / output time)
ctype (character classification)
messages (selection of messages)

In reality, we constantly use facets, even without knowing it. The standard template library uses lacalization for I / O. boost :: regex for case conversion, etc. Localization is set by the platform. Users of * nix systems are familiar with such strings as "ru_RU: UTF-8", "en_US.UTF-8" - these are the names of localizations in the platform. The program uses custom localization. If localization is not specified by the user, “classic” is used.

Example of using localization and cell overrides

Consider an example in which we will try the technique of replacing the standard localization facet. Usually, streaming I / O is considered, but I would like to focus on what can happen if you write code that is dependent on localization, without knowing what it is. Let's try using locales with the common library boost :: xpressive (you can use boost :: regex, but for those who first hear about xpressive, it will be useful to read about it):

#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }
#include <boost/xpressive/xpressive.hpp> #include <iostream> using namespace std ; using namespace boost :: xpressive ; int main ( int argc, char * argv [ ] ) { sregex xpr = sregex :: compile ( "" , regex_constants :: icase ) ; smatch match ; string str ( " !" ) ; if ( regex_search ( str, match, xpr ) ) cout << "icase ok" << endl ; else cout << "icase fail" << endl ; return 0 ; }

')
Some will be surprised that the issuance of the program strongly depends on the platform . Moreover, on one platform, the program may produce different results. It's all about the locale. If we assume that the encoding of the example file is windows-1251, then the “icase fail” result can be achieved on a platform in which the user locale has an encoding different from cp1251. The most common example of such a platform is mingw (downloaded as a sourceforge binary) + Windows. In this case, the boost :: xpressive algorithms simply do not know which characters in the extended part of the cp-1251 code table are letters. And the ctype facet of classical localization is to blame for this. Having reported the correct ctype of localization with which xpressive works, we will achieve the desired result. In the simplest case, if the necessary localization is installed in the system, we just need to make it global

// set global localization
std :: locale cp1251_locale ( "en_US.CP1251" ) ;
std :: locale :: global ( cp1251_locale ) ;

or report it to the regex compiler

std :: locale cp1251_locale ( "en_US.CP1251" ) ;
sregex_compiler compiler ;
// tell the regex compiler what localization to use
compiler. imbue ( cp1251_locale ) ;
sregex xpr = compiler. compile ( "world" , regex_constants :: icase ) ;

Everything would be fine, but on a platform where localization is not supported ru_RU: CP1251 our code will throw an exception. At best, the name is incorrectly specified, at worst - the necessary localization is not in the system. We will solve this problem by implementing our own ctype facet (it is he who will explain to xpressive which characters are letters and how the case changes).

The simplest example of implementing the ctype facet and the example for encoding CP1251:

#include <boost / xpressive / xpressive.hpp>
#include <iostream>
using namespace std ;
using namespace boost :: xpressive ;
/ ** @ brief A very simplified example of the ctype facet for working correctly with
* Encoding Cp1251 * /
class ctype_cp1251 : public ctype < char >
{
public :
/ ** @ breif mask in ctype_base is an enumeration of all possible types.
* characters - alpha, digit, ... * /
typedef typename ctype < char > :: ctype_base :: mask mask ;
// for brevity, let's rename the constants
enum {
alpha = ctype < char > :: alpha ,
lower = ctype < char > :: lower ,
punct = ctype < char > :: punct
// other masks
} ;
/ ** @ brief Main constructor. r - characterizes the area of life
* cell. For details, see the book of Straustrup. * /
ctype_cp1251 ( size_t r = 0 )
{
// initialize the table of masks. The index is the negative part of the char.
// That is, ext_tab [1] - the mask for the character char (-1) - 'I'
ext_tab [ 0 ] = 0 ;
for ( size_t i = 1 ; i <= 32 ; ++ i )
ext_tab [ i ] = alpha | lower ;
for ( size_t i = 33 ; i <= 64 ; ++ i )
ext_tab [ i ] = alpha | upper ;
// ... the rest of the characters in this example are uninteresting
for ( size_t i = 65 ; i <= 128 ; ++ i )
ext_tab [ i ] = punct ;
}
~ ctype_cp1251 ( )
{ }
protected :
/ ** @ brief Answers the question whether the character c matches the mask m * /
virtual bool is ( mask m, char c ) const
{
if ( 0 <= c && c <= 127 )
return ctype < char > :: is ( m, c ) ;
else if ( - 128 <= c && c < 0 )
return ext_tab [ static_cast < size_t > ( c * - 1 ) ] & m ;
}
/ ** @ brief Converts the character c to uppercase * /
virtual char do_toupper ( char c ) const
{
if ( 0 <= c && c <= 127 )
return ctype < char > :: do_toupper ( c ) ;
else if ( is ( lower, c ) )
return c - 32 ;
return c ;
}
/ ** @ brief Converts the character c to lower case * /
virtual char do_tolower ( char c ) const
{
if ( 0 <= c && c <= 127 )
return ctype < char > :: do_tolower ( c ) ;
else if ( is ( upper, c ) )
return c + 32 ;
return c ;
}
// not to complicate the example, we will not override the rest
// virtual functions
private :
// ban on copying
ctype_cp1251 ( const ctype_cp1251 & ) ;
const ctype_cp1251 & operator = ( const ctype_cp1251 & ) ;
mask ext_tab [ 129 ] ; // @ <masks of the extended part of the code table CP1251
} ;
int main ( int argc, char * argv [ ] )
{
// create an instance of the cell
ctype < char > * ctype_cp1251_facet = new ctype_cp1251 ( ) ;
// Create a new localization based on the current, using
// defined above facet. You can define a global
// localization with the described facet, then all the classes and
// functions will use it.
locale cp1251_locale ( locale ( "" ) , ctype_cp1251_facet ) ;
// create regex compilers with specific localization
sregex_compiler compiler ;
compiler. imbue ( cp1251_locale ) ;
sregex xpr = compiler. compile ( "world" , regex_constants :: icase ) ;
smatch match ;
string str ( "HELLO WORLD!" ) ;
if ( regex_search ( str, match, xpr ) )
cout << "icase ok" << endl ;
else
cout << "icase fail" << endl ;
return 0 ;
}

Now the result of the program will not depend on the specific platform. By overriding the standard facets or adding new ones, you can control the behavior of the algorithm / program depending on the cultural and linguistic characteristics of users.

A complete description of the std :: local class and techniques for using facets can be found in the 3rd special edition of Bjørn Straustrup’s C Programming Language ++, in the appendix. To clarify the structure of the facets, you can use any STL manual. For example here .

The task of converting encodings is solved by implementing the codecvt facet. If it is interesting, I will tell about it in the next article.

______________________

Source: https://habr.com/ru/post/104417/

All Articles

Do not forget about language and cultural features

Example of using localization and cell overrides

The text was prepared in the Blog Editor from © SoftCoder.ru

More articles: