📜 ⬆️ ⬇️

The Stage for the Tower of Babel, or On Custom Data Types for Multilingual Applications

Engraving by M. Escher & quot; The Tower of Babel & quot;
Engraving by M. Escher "Tower of Babel", 1928



Introduction


Perhaps you are ready for the fact that your application will be multilingual at the start of the project. But most likely, the news of the need for internationalization, as it once happened to humanity, will find you in the midst of the construction of the Tower of Babel. In any case, it is useful to have with you a gentleman's set of tools that give a chance to complete the construction of the century successfully.


Four thousand years after the Tower of Babel, technology offers us some wonderful tools. What do we have?


First, the hodgepodge is an abstraction of locale. Locale includes not only language, but also writing, calendar, formatting rules for numbers, monetary units, dates, etc.


Secondly, Unicode. Unicode is not just a character encoding table. It also includes various forms of the same letters, accented characters, the sort order of characters, case-insensitive rules, string normalization algorithms, the UTF character set of characters, and much more.


All this is a great help. Such capabilities, as a rule, are already built into the operating systems and are available in standard libraries. Programmers and users in all corners of the planet successfully use the same operating systems, development tools, databases. But, alas, there is no perfection in the world ... If your application must simultaneously serve users in many languages, you, whoever you are (analyst, architect or programmer), have new needs.


Next, we will tell you about some of these often-needed needs in enterprise applications, based on the experience of our company. Code examples in the article will be in C #. The source code of the library is laid out on GitHub, including the data types in question, their workable implementations, and more. Despite the fact that the material contains some specificity of .NET, the outlined concepts of working with multilingual data will be useful for specialists on other platforms.


And for starters, we recommend that you read the previous material on the internationalization of applications.



Conditions of the problem


Imagine that our application should work with several languages ​​at once. Depending on the user's environment, on any of them, user interfaces will not only be displayed, but also operational and reference data will be entered. At the same time, in one session from several variants of data localization, both the variant in only one specific language and the localization for all languages ​​can be used.


For example, consider the domain nature of the product, which has, among other attributes, the article number and name in various languages. We need to be able to describe the domain entity, display and enter product records through the user interface, and also print price tags.


')

Multilingual strings


The first thought about the name of the product in the domain entity is a dictionary with a locale code as the key.


public class Product {    public string Code { get; set; }    public IDictionary<string, string> Name { get; set; } } 

The option impresses with its simplicity, but immediately violates the principles of designing public contracts, because the dictionary IDictionary<string, string> has no clear semantics. A little to save the situation can be the renaming of the Name entity attribute to MultilingualName and the use of such an agreement wherever the semantics of the multilingual attribute is required.


If you think about it, then you and I will surely find cases when it is necessary to do the same operation with lines in several (or all at the same time) languages ​​simultaneously (for example, to bring all the letters of the title to uppercase). It would seem, what could be easier?


 static IDictionary<string, string> ToUpper(IDictionary<string, string> source) {   IDictionary<string, string> destination = new Dictionary<string, string>();   foreach (var pair in source)   {       destination[pair.Key] = pair.Value.ToUpper();   }   return destination; } 

Or quite short:


 static IDictionary<string, string> ToUpper(IDictionary<string, string> source) {   return source.ToDictionary(p => p.Key, p => p.Value.ToUpper()); } 

However, the error has already crept into the code, however, it is world-famous: we did not pass The Turkey Test .


The fact is that to change the case of characters you need to apply the rules of a particular language. And if we do not specify it, the current locale (regional settings locale) is used.


Here we will make a reservation that the locale in .NET is called culture. Two cultures are available for each stream: CurrentCulture and CurrentUICulture . The first is used to format numbers, dates, and other regional settings, and the second is used in the search algorithm for suitable localized resources, such as strings, images, layout of user interfaces, etc.


Since we intentionally change strings for different locales, the correct code might look like this:


 static IDictionary<string, string> ToUpper(IDictionary<string, string> source) {   IDictionary<string, string> destination = new Dictionary<string, string>();   foreach (var pair in source)   {       var culture = CultureInfo.GetCultureInfo(pair.Key);       destination[pair.Key] = pair.Value.ToUpper(culture);   }   return destination; } 


Draft new data type


These two facts: the desire to follow a good design style and a high likelihood of errors during regular work with multilingual data - may well and should encourage us to introduce a new data type - a multilingual string .


What should be able to multilingual string? It is necessary at least:



At the same time, it seems intuitively that a multilingual string with its behavior and properties should be very similar to a regular string:



However, a multilingual string should not explicitly support string concatenation. Concatenation in localized applications is practically forbidden (at least within the same sentence), because the word order may differ in different languages.


So, let's see what we get
 /// <summary>  . </summary>   /// <remarks>           . ///   <see langword="null"/>-  . /// </remarks>   [Serializable] public sealed class MultiCulturalString {   #region    /// <summary>  . Ctor. </summary>   private MultiCulturalString() {...}   /// <summary>  . Ctor. </summary>    public MultiCulturalString(IEnumerable<KeyValuePair<CultureInfo, string>> localizedStrings) {...}   /// <summary>  . Ctor.        ///   . </summary>   public MultiCulturalString(CultureInfo culture, string value) {...}   #endregion   #region     /// <summary>   <paramref name="value"/>  <c>null</c>   ///  <see cref="MultiCulturalString"/>     ? </summary>   public static bool IsNullOrEmpty(MultiCulturalString value) {...}   /// <summary>   <paramref name="value"/>  <see langword="null"/>   ///  <see cref="MultiCulturalString"/>       ///   ? </summary>   public static bool IsNullOrWhiteSpace(MultiCulturalString value) {...}   /// <summary>      . </summary>   public static MultiCulturalString Join(MultiCulturalString separator, params object[] args) {...}   /// <summary>          /// <paramref name="localizedString"/>   <paramref name="culture"/></summary>   public MultiCulturalString SetLocalizedString(CultureInfo culture, string localizedString) {...}   /// <summary>        </summary>   public MultiCulturalString MergeWith(MultiCulturalString other) {...}   /// <summary>         . </summary>   public bool ContainsCulture(CultureInfo culture) {...}   /// <summary>    ,    .   ///         . </summary>   public MultiCulturalString ToLower() {...}   /// <summary>    ,    .   ///         . </summary>   public MultiCulturalString ToUpper() {...}   /// <summary>   ,         ///       -    .   /// </summary>   public MultiCulturalString PadLeft(int totalWidth, char paddingChar = ' ') {...}     /// <summary>   ,         ///       -    .   /// </summary>   public MultiCulturalString PadRight(int totalWidth, char paddingChar = ' ') {...}   #endregion   #region  ToString()   /// <summary>    UI-  </summary>   public override string ToString() {...}   /// <summary>      </summary>   public string ToString(CultureInfo culture) {...}   #endregion   #region    /// <summary>   ,       ///    .</summary>   public static MultiCulturalString Empty {...}   /// <summary>   ,     . </summary>   public IEnumerable<CultureInfo> Cultures {...}   /// <summary>     ? </summary>   public bool IsEmpty {...}   /// <summary>        ? </summary>   public bool IsWhiteSpace {...}   #endregion } 

But inside the class all the same dictionary and simple manipulations with it are hidden.


And the product description looks quite decent:


 public class Product {   public string Code { get; set; }   public MultiCulturalString Name { get; set; } } 


Enhancements


As soon as we begin to implement or use the above methods, we will encounter several previously not entirely obvious problems.



Tostring () is not enough


Imagine that when entering a product, we filled in the name only for some of the required languages:


 var ru = CultureInfo.GetCultureInfo("ru"); var en = CultureInfo.GetCultureInfo("en"); var product = new Product {   Code = "V0016887",   Name = new MultiCulturalString(ru, " ")       .SetLocalizedString(en, "Chocolate Alina") }; 

And then they requested a name for the missing language:


 var zhHans = CultureInfo.GetCultureInfo("zh-Hans"); Console.WriteLine(product.Name.ToString(zhHans)); // ? 

What result would you expect to get?


Well, no exception! Maybe null ? Probably! But the documentation for Object.ToString() does not recommend returning either null or an empty string. And Code Contracts directly prohibit the return of null .


Nevertheless, we need to be able to distinguish between the situation of the presence of an empty string for a given locale from the case of its absence. Therefore, our multilanguage string class will grow using GetString(...) methods that will be able to return null and have the same signatures as the ToString(...) methods.



Formatting


As we have already said, we cannot use string concatenation, so substitutions are our everything. In most cases, localized strings contain the same substitutions for all locales.


Therefore, it would be good to be able to format a multilingual string. What would it mean? After all, we immediately guessed to support overloading GetString(CultureInfo) / ToString(CultureInfo) . But the standard for .NET method of converting any objects into a string representation with customizable (!) Is the implementation of the IFormattable interface. If the arguments involved in the substitutions implement this interface, then it will be used to convert the argument to a string. Thus, we have to implement IFormattable in a multilingual string.


As a format provider for the IFormattable.ToString(string format, IFormatProvider formatProvider) method IFormattable.ToString(string format, IFormatProvider formatProvider) you can use the locale (culture). And the first parameter allows you to set formatting parameters that are independent of locale. For example, you can set the percentage to display a percentage in English for India:


 //          // https://en.wikipedia.org/wiki/Indian_numbering_system 12345.6789.ToString("P", CultureInfo.GetCultureInfo("en-IN")); // 12,34,567.89% 

So, let's try to create a price tag for the same product:


 var ru = CultureInfo.GetCultureInfo("ru"); var en = CultureInfo.GetCultureInfo("en"); var product = new Product {   Code = "V0016887",   Name = new MultiCulturalString(ru, " ")       .SetLocalizedString(en, "Chocolate Alina") }; IFormatProvider localizationFormatProvider = en; Console.WriteLine(string.Format(localizationFormatProvider,   ": {0}\r\n: {1}",   product.Code,   product.Name)); // : V0016887 // : Chocolate Alina 

Great, we got the line " : V0016887\r\n: Chocolate Alina ", as expected! Now let's slightly complicate the task by adding the date of its creation to the price tag and placing the user in the English-language interface with Russian regional settings:


 Thread.CurrentThread.CurrentCulture = ru; IFormatProvider localizationFormatProvider = en; Console.WriteLine(string.Format(localizationFormatProvider,   ": {0}\r\n: {1}\r\n: {2:d}",   product.Code,   product.Name,   DateTime.Now)); // : V0016887 // : Chocolate Alina // : 11/25/2016 

And what did the reader expect to get? The author, for example, would expect to receive " : V0016887\r\n: Chocolate Alina\r\n: 25.11.2016 ".


Yes, yes, we should not forget about the separation of regional and localization settings.


In the .NET Framework there are at least three standard implementations of IFormatProvider ( CultureInfo, NumberFormatInfo, DateTimeFormatInfo ), and none of them are suitable for us. We need our own implementation, which will carry information about the required locale for localization, in particular, for multilingual strings, but will not be used to format numbers and dates. Let's call it LocalizationFormatInfo . Using looks no more difficult than the code before:


 Thread.CurrentThread.CurrentCulture = ru; IFormatProvider localizationFormatProvider = new LocalizationFormatInfo(en); Console.WriteLine(string.Format(localizationFormatProvider,   ": {0}\r\n: {1}\r\n: {2:d}",   product.Code,   product.Name,   DateTime.Now)); // : V0016887 // : Chocolate Alina // : 25.11.2016 

And the implementation of IFormattable in MultiCulturalString looks like this:


 string IFormattable.ToString(string format, IFormatProvider formatProvider) {   // format     var formatInfo = LocalizationFormatInfo.GetInstance(formatProvider);   return ToString(formatInfo.Culture ?? CultureInfo.CurrentUICulture); } 

But the ability to delegate the formatting of non-multilingual strings (dates, numbers and anything else) to other suppliers in LocalizationFormatInfo will be very useful.


The first draft of LocalizationFormatInfo may look like this.
 /// <summary>    . </summary> [Serializable] public sealed class LocalizationFormatInfo : IFormatProvider {   /// <summary>    . </summary>   /// <param name="culture">   .</param>   /// <param name="provider">  .</param>   public LocalizationFormatInfo(CultureInfo culture, IFormatProvider provider = null)   {       _culture = culture;       _provider = provider;   }   /// <summary>      -     . </summary>   public object GetFormat(Type formatType)   {       if (formatType == GetType())       {           return this;       }       if (Provider != null)       {           //     ,     .           return Provider.GetFormat(formatType);       }       return null;   }   /// <summary>    .   null. </summary>   public CultureInfo Culture   {       get { return _culture; }   }   private readonly CultureInfo _culture;   /// <summary>   .   null. </summary>   public IFormatProvider Provider   {       get { return _provider; }   }   private readonly IFormatProvider _provider;       /// <summary>   ///   <paramref name="provider"/>  <see cref="LocalizationFormatInfo"/>.   /// </summary>   /// <param name="provider">  .   <see langword="null"/>.</param>   /// <returns> <see cref="LocalizationFormatInfo"/>.</returns>   public static LocalizationFormatInfo GetInstance(IFormatProvider provider)   {       LocalizationFormatInfo lfi = null;       //            if (provider != null)       {           lfi = provider.GetFormat(typeof(LocalizationFormatInfo)) as LocalizationFormatInfo;       }       return lfi ?? Default;   }   private static readonly LocalizationFormatInfo Default = new LocalizationFormatInfo(null); } 

The development of the considered example can be the transformation of the formatting line of the price tag into a multilingual string.



Search for a suitable location


Let's once again imagine that we created a copy of the product with the name in Russian and English, and then requested the name for the missing Chinese language. The question is the same: what result would you expect to get?


In the previous section, we settled on the possibility that null acceptable.


Consider common situations. New versions of applications are coming out, but the translation cycle is not in time for all the changes in time. Many free products are translated by enthusiasts, often localizations from older versions are used in new ones. As a result, not all user interface elements can be translated or the desired user language is not supported by the application at all.


Obviously, in such cases, empty in the interface is not allowed. It is necessary to display resources for at least some language. It is desirable that the displayed element can be perceived by the user: recognized, read, but not necessarily understood or translated.


An example of the absence of some translations, replacement of pt locale resources with en

Here comes the rational assumption that for a localized application there is a default locale for which the set of resources is always relevant and complete.


However, there is another problem with the default locale: for a user in Kazakhstan, in the absence of Kazakhstan localization, it is most natural to display a resource for the Russian locale, while for a user in China it is logical to show the resource for the English locale, since in China, English at least somehow owns a large proportion of the population than Russian.


The documentation on localization in .NET describes the term resource fallback process, which can be translated into Russian as "processing alternative resources." The essence of processing is that if the corresponding resources are not found for the current user interface locale, then an attempt will be made to find resources for the parent locale. So, for the en-IN locale, the parent en-IN will be the neutral locale en (neutral - not containing region-specific). Therefore, in most cases, neutral locales can be recommended for storing the universal language for the dialects of one localization language. And for the en locale, in turn, the parent will be invariant, if selected, an attempt will be made to find the default resources that must always exist.


Alas, in the .NET Framework, the logic for processing alternative resources is “wired” deep into the platform’s depth.


Our task is to learn how to customize the process of finding resources. To do this, let's introduce the abstraction IResourceFallbackProcess . Its sole responsibility will be to generate convenient locale sequences for us to search for suitable resources. At the same time, completely different classes, for example, ResourceManager , are responsible for searching and loading resources (in the file system, database, etc.).


Imagine a new interface:


 public interface IResourceFallbackProcess {   /// <summary>   ///         ,   ///     .   /// </summary>   /// <param name="initial"> .</param>   IEnumerable<CultureInfo> GetFallbackChain(CultureInfo initial); } 

This interface will allow us to implement our plans for each locale of user interfaces:



And, of course, IResourceFallbackProcess should be actively used in a multilingual string. The overloads of the GetString(...) / ToString(...) methods look appropriate with the IResourceFallbackProcess resourceFallbackProcess and bool useFallback , overloads without useFallback use true , and overloads without resourceFallbackProcess is a standard search order for your application.



Conclusion


The source code of our library shows a working implementation of the IResourceFallbackProcess . It may be useful for the reader to make this implementation configurable, as well as create your own CustomizedResourceManager using IResourceFallbackProcess . You can also write an extension for Visual Studio so that classes automatically generated for resource files use your CustomizedResourceManager .


Obviously, we considered not all possible “stages”, but only the most demanded and universal ones. For example, you can think about MultiCulturalStringBuilder , and for formatting - about IMultiCulturalFormattable .


In the next article “Basements of the Tower of Babel, or About the Internationalization of the Database with Access via ORM”, we will look at the storage of localized data in the database and access to it through the object-relational mapper.

Source: https://habr.com/ru/post/313284/


All Articles