📜 ⬆️ ⬇️

The cellars of the Tower of Babel, or the internationalization of databases with access through ORM

Engraving by M. Escher & quot;  Relativity & quot;
Engraving by M. Escher, "Relativity", 1953


Introduction


In the previous article on the example of the domain nature of the product, we looked at our own data types for multilingual applications. We learned to describe and use entity attributes that have meanings in various languages. But the issues of storage and processing in a relational DBMS, as well as the problems of effective work in the application code are still relevant.


The IT community uses various ways to store multilingual data. These methods are fundamentally different in the efficiency of requests, resistance to the addition of new localizations, data volume, convenience for the consumer application.


However, there is still no Database Internationalization for Dummies solution in the industry. Together with you, we will try to fill this gap a bit: we will describe possible ways, evaluate their advantages and disadvantages, and choose effective ones. We are not going to reinvent the silver bullet, but the scenario we’ll be looking at is pretty typical for enterprise applications. We hope it will be useful to many.


The following code snippets are in C #. On GitHub, you can find examples of the implementation of internationalization mechanisms using two different ORM and DBMS bundles: NHibernate + Oracle Database and Entity Framework Core + SQL Server . Developers using the mentioned ORM will be interested in finding out specific techniques and difficulties in working with multilingual data, as well as blocking framework defects and prospects for their elimination. The principles and examples of working with multilingual data outlined below can be easily transferred to other languages ​​and technologies.



Conditions of the problem


Our application should work with several languages ​​at once. On any of them, depending on the user's environment, operational and reference data will be displayed and entered. At the same time, there are scenarios that require the ability to work with data in all languages ​​at once in one session.


For example, consider the domain entity of a product that is already familiar to us and has a name in various languages.


public class Product {    public long Id {get; set; }    public String Code { get; set; }    public MultiCulturalString Name { get; set; } } 

We formulate requirements for the storage and processing of entity data with multilingual attributes:


  1. Storage, reading and writing of such entities must be through the ORM (mostly), and the means of the DBMS (at least for the purpose of maintenance).
  2. Multilingual attributes can be several.
  3. Localization requirement must be met.
  4. It should be possible to quickly search and sort by localized values ​​of a multilingual attribute:
    • only for a given locale;
    • taking into account the specified algorithm for processing alternative resources (if there is no value for the requested locale).
  5. It should be possible to search by localized values ​​of a multilingual attribute among all locales.
  6. It should be possible to expand to localize not only string attributes, but also attributes of other types ( MultiCultural<T> is useful).
  7. The amount of stored data and traffic between the database and the application must be valid.

Recall that the alternative resource processing algorithm IEnumerable<CultureInfo> IResourceFallbackProcess.GetFallbackChain(CultureInfo initial) can return different localization search orders for different initial locales:



Overview of existing features


DBMS


What features are in the largest modern DBMS for internationalization? It:



These features are supported by many large vendors. Here are links to some articles:



However, there are no “standard” storage schemes for multilingual data. There are none of them in the latest SQL standard: 2016.


Among the curious academic publications, one can note the master's thesis of the ex-head of the Multilingual Systems Research Group at Microsoft Multilingual Information Processing , 2005. The paper deals with the problems of cross-language queries, multilingual join operators, query algebra for new types of storage multilingual data and the mentioned operators.


ORM


Documentation on NHibernate pleased with the presence of the article Localization techniques . It discusses two storage methods (with variations):


  1. One column in which the user data type is stored is the locale data dictionary.
  2. Separate tables for storing localized attributes (1-2 pieces).

It is noteworthy that this article does not even mention the rather obvious and popular storage method - multicolumn (column per locale).


No relevant materials could be found on the Entity Framework.


Storage Comparison


To begin with, we will create comparison criteria. To do this, we reformulate the requirements described above into more technical ones. As we will see further, they are all pretty tough.


Localizability. Initially, the requirement means no changes in the code when a new localization is added to the application. With regard to applications with localized data, for example, we can say that there are no changes in the data schema and ORM mappings.


Search and sort for a given locale. In terms of a multilingual string, this means the ability to use indexes on the results of the String MultiCulturalString.GetString(IResourceFallbackProcess resourceFallbackProcess, CultureInfo culture, bool useFallback) function String MultiCulturalString.GetString(IResourceFallbackProcess resourceFallbackProcess, CultureInfo culture, bool useFallback) for all combinations of values ​​of its parameters that are meaningful in this application. The application IResourceFallbackProcess is probably the only useFallback = false , when useFallback = false resourceFallbackProcess does not matter, therefore the number of necessary indices for localized data is not more than 2N , where N is the number of different locales used in all entries in the multilingual attribute.


Search among all locales. Support is required for the function of the form MultiCulturalString.FindLocalizedStringCulture(String localizedString) with the return type CultureInfo .


Search for a suitable location. The user's locale is always specific, that is, it determines not only the language, but also regional parameters. And for storage in most cases it is enough to use neutral locales (which do not specify the specificity of a region), “closest” to the selected specific ones.


Therefore, from the above ten locales it is enough for us to provide storage for only four neutral ones: ru , en , kz , zh-Hans . And in ORM, it is possible to support the functionality of casting a specific locale to neutral by assigning an attribute and (or) saving an entity.


Suitability for ORM. The selected ORM should have enough extension points so that we can put all our multilingual storage fantasies into practice. Otherwise you have to write a new framework.


Multicolumn storage


This option requires you to know in advance the list of necessary localizations. And for each new locale in use, we need to add a new column and a couple of indices. And so for each multilingual attribute in all tables of localizable entities. You may have to change the ORM mapping in the code and (or) database client configuration, especially for static models.


But do not immediately get upset. Changes in the code and configuration, most likely, will not become a big problem for an application in which the list of used locales and / or dynamic entities are rarely known and rarely changes. In addition, as we shall see later, the requirement of localizability one way or another will have to be relaxed for all the considered options.


Multicolumn storage chart

With the creation of indexes for each of the name_* -columns for the case when we do not use the search for a suitable localization, everything seems to be obvious. But what will the query look like in which the IResourceFallbackProcess should work?

Consider, for example, such a query:


 var enUS = CultureInfo.GetCultureInfo("en-US"); var productName = "..."; var result = GetRepository<Product>()   .Where(p => p.Name.ToString(enUS) == productName)   .SingleOrDefault(); 

I think you will agree that it may well be implemented by the following SQL query:


 SELECT pr.id_product, pr.code, pr.name_ru, pr.name_en, pr.name_kz, pr.name_zh_hans FROM t_product pr WHERE isnull(pr.name_en, isnull(pr.name_ru, '')) LIKE @p1 ORDER BY isnull(pr.name_en, isnull(pr.name_ru, '')), pr.id_product 

To get an “honest” null instead of an empty string when the requested value is missing, from the point of view of a multilingual string, use one of the overloads MultiCulturalString.GetString . Then the filtering expression in SQL will be slightly easier:


 SELECT pr.id_product, pr.code, pr.name_ru, pr.name_en, pr.name_kz, pr.name_zh_hans FROM t_product pr WHERE isnull(pr.name_en, pr.name_ru) LIKE @p1 ORDER BY isnull(pr.name_en, pr.name_ru), pr.id_product 

We get that even at least one index per locale we need to build on an expression with isnull , which is a SQL mapping of the search chain for alternative resources. Such indices are usually called functional, in SQL Server these are indices above computable columns. In the above description of the t_product table t_product these indexes have the suffix _stdfallback .


About the type of request in the case of searching for a string among all localizations, we will provide the reader with an opportunity to dream up.


This method of storage and access is implemented for one of our company's clients.


Single-column serialized storage


In this embodiment, the multilingual attribute occupies only one column. A multilingual string can be serialized in both binary and human-readable form. Or, for example, in Oracle, a column may have an object type.


Given the current trends in the development of the SQL standard, you should pay attention to XML or JSON storage. Oracle Database, SQL Server, DB2, PostgreSQL, MySQL have pretty strong support for XML and (or) JSON.


Making such a decision, we need to take care of the possibility of obtaining (and someone - and recording) localized values ​​by means of the DBMS. A universal and well encapsulating specific variant of serialization will be the creation of a custom scalar function that accepts a serialized value, a locale and a string for searching for a suitable localization, and returns a localized string.




The SQL query for the search will look something like this:
 SELECT pr.id_product, pr.code, pr.name FROM t_product pr WHERE McsGetString(pr.name, 'en', 'en,ru') LIKE @p1 ORDER BY McsGetString(pr.name, 'en', 'en,ru'), pr.id_product 

When a new locale appears, we still have to add new functional indexes based on the results of the McsGetString function. Adding only indexes is a safer action than adding new columns. Unlike multi-column storage, ORM mappings will probably not have to be changed.


Relational storage


Here we have a separate table for storing localized values. Variations of such storage are numerous. Consider only one of them.



')
SQL query for searching only one locale looks quite simple:
 SELECT pr.id_product, pr.code, pl.name FROM t_product pr LEFT JOIN t_product_localizable pl ON pr.id_product = pl.id_product AND pl.locale = 'en' WHERE pl.name LIKE @p1 ORDER BY pl.name, pr.id_product 

The SQL query, taking into account the search for suitable localizations, is already more cumbersome:


 SELECT pr.id_product, pr.code, pl_en.name, pl_ru.name FROM t_product pr LEFT JOIN t_product_localizable pl_ru ON pr.id_product = pl_ru.id_product AND pl_ru.locale = 'ru' LEFT JOIN t_product_localizable pl_en ON pr.id_product = pl_en.id_product AND pl_en.locale = 'en' WHERE isnull(pl_en.name, isnull(pl_ru.name, '')) LIKE @p1 ORDER BY isnull(pl_en.name, isnull(pl_ru.name, '')), pr.id_product 

Such a request presents us with a whole bunch of surprises.


First, the cost of the request is already noticeably higher than in the previous versions.


Secondly, using such a query, it is problematic to instantiate an entity with a fully initialized multilingual attribute. Either we have to add a LEFT JOIN for all locales whose list you need to know in advance, or stop using the MultiCulturalString attribute, replacing it with a String and reducing the number of covered scenarios. Another option is to support lazy loading of values ​​for different locales within a multilingual string. Each of the alternatives has the right to life according to the specific requirements for the product.


Thirdly, we are going to support the work with multilingual attributes in existing ORMs. But we can hardly find such an extension point to add a JOIN to the SQL query, while remaining within the framework of a simple LINQ query:


 var result =  GetRepository<Product>()   .Where(p => p.Name.ToString(enUS) == productName)   .SingleOrDefault(); 

Connections with the t_product_localizable table can be replaced with subqueries, but this does not fundamentally change anything.


A very interesting result, isn't it? But the requirement of localization is satisfied.


findings


As we expected, it was not possible to find a silver bullet among the options considered. In each of these, one or more requirements are more or less violated. However, this is not a reason to do nothing. We need to make a compromise decision, choose a middle ground.


We considered the single-column serialized storage variant to be the most balanced and promising. Therefore, in the next section, we propose to consider the features of the implementations of this option for NHibernate + Oracle Database bundles and the Entity Framework Core + SQL Server . Serialization of multilingual attributes will be done in XML, and for access to localized values ​​from the database - use the means of the DBMS.


We expand ORM


So, we will make an attempt to support work with entities that contain an attribute of the MultiCulturalString type, as well as a query to search for such entities by the localized attribute value.


We will not consider the question of searching for a multilingual attribute in all locales, giving it to the interested reader.


In the implementation we proceed from the following principles:


  1. Entities are described by POCO classes (in Java - POJO).
  2. Entities are separated from mapping to database objects, including from conversions to database format and back.
  3. Whenever possible, we use the native API of a multilingual string in queries. Thus we will receive transparent use of one API both on the client, and on the server.
  4. The division into assemblies should be fine for better separation of responsibilities, dependency control, and reuse.

We give a diagram of dependencies with small division into assemblies.

Dependency diagram


We will serialize the values ​​of the multilingual string in XML. In .NET, responsibility for the content of the serialized data ( ISerializable , ISerializationSurrogate ) and the final serialized view ( IFormatter ) are IFormatter . And if it is quite logical to place the first responsibility on the object itself being serializable, the second one strongly depends on the use. Therefore, for formatting, apply the XmlFormatter: IFormatter found on the Internet, using ISerializable objects.


NHibernate + Oracle Database


NHibernate, perhaps, has long been the most functional ORM under .NET. At the same time, it contains contradictory stratifications that have arisen at different stages of its development. Now versions are extremely rare, there are few contributors left. Some long-awaited corrections of defects, apparently, will never come out.


To load and save the values ​​of a multilingual string, we need to implement an IUserType in which to use the mentioned XmlFormatter .


Multilingual string in XML looks pretty obvious.
 <?xml version="1.0" encoding="UTF-8"?> <MultiCulturalString xsi:type="MultiCulturalString" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://custis.ru/i18n"> <ru> </ru> <en>Chocolate Alina</en> </MultiCulturalString> 

The “heart” of the function for accessing localized values ​​McsGetString uses XQuery.
 SELECT XMLCast( XMLQuery('declare namespace i18n="http://custis.ru/i18n"; for $mcs_locale in $mcs/i18n:MultiCulturalString/* where $mcs_locale/name() = $locale return $mcs_locale/text()'   PASSING a_mcs AS "mcs", a_locale AS "locale" RETURNING CONTENT) AS VARCHAR2(4000 CHAR)) INTO l_value FROM dual; 

To use this function in HQL , it is enough to add the implementation of ISQLFunction to the NHibernate configuration. But besides, we want the LINQ-to-Database call to MultiCulturalString.ToString() overloads to turn into a call to the McsGetString function. However, for this, NHibernate has an extension point: it is enough to implement IHqlGeneratorForMethod and also register the implementation in the configuration. The IHqlGeneratorForMethod implementation IHqlGeneratorForMethod expected to convert one expression tree to another.


Here, strictly speaking, that's all. Are we really great, but is everything so seamless with NHibernate? Unfortunately no!


Such a simple test will not work.
 [Test] public void TestNh2500() {   using (var session = SessionFactory.OpenSession())   {       var product = new Product       {           Code = ProductCode,           Name = new MultiCulturalString(ru, ProductNameRu)               .SetLocalizedString(en, ProductNameEn)       };       session.Save(product);   }   using (var session = SessionFactory.OpenSession())   {       var product = session.AsQueryable<Product>()           .SingleOrDefault(p => p.Name.ToString(zhCHS)) == ProductNameEn);       Assert.IsNotNull(product);       Assert.AreEqual(ProductCode, product.Code);   }   using (var session = SessionFactory.OpenSession())   {       var product = session.AsQueryable<Product>()           .SingleOrDefault(p => p.Name.ToString(ruRU)) == ProductNameEn);       // The next line throws AssertionException       Assert.IsNull(product);   } } 

The same SQL will be generated for both queries:


 select       product0_.id_product as id1_0_,       product0_.code as code0_,       product0_.name as name0_   from       t_product product0_   where       McsGetString(product0_.name, 'zh-CHS', 'zh-CHS,zh-Hans,zh,en')=:p0; 

The whole thing is in the NH-2500 defect: LINQ requests even of different sessions are cached at the session factory level and reused, despite the different values ​​of the request parameters. Although the bug has been critical for six years, the fix will only be included in the future version 5.1. In the meantime, you can release fork NHibernate. If you need to correct any other defects of NHibernate, the low dynamics of the product makes the risks of "swelling" of your edits low.


Entity Framework Core + SQL Server


Not so long ago, EF Core 2.0 was released , and further development is behind this cross-platform branch. For our examples, we chose it, not EF 6 , since we hope for a speedy solution to the problems described below.


Unfortunately, in EF Core we cannot implement support for a custom type, including MultiCulturalString . But already in version 2.1 of the framework, this possibility will appear (see issue # 242 ).


For now, let's create a proxy class for our product with the RawName string attribute.
 public class Product {   public virtual long Id { get; protected set; }   public virtual String Code { get; set; }   public virtual MultiCulturalString Name { get; set; } } public class ProductProxy : Product {   public override MultiCulturalString Name   {       get => base.Name;       set       {           _rawName = ConvertToStoredValue(value);           base.Name = value;       }   }   public virtual String RawName   {       get => _rawName;       set       {           base.Name = ParseStoredValue(value);           _rawName = value;       }   }   private String _rawName;   ... } 

It is the proxy class that will participate in our DbContext . As you guessed, the Name attribute will not participate in the mapping at all. And of course, the reader would like to use the RawName attribute used in mapping to be protected , since ORM doesn’t interfere. But do not rush!


Already familiar to us request


 var result = GetRepository<ProductProxy>()   .Where(p => p.Name.ToString(enUS) == productName)   .SingleOrDefault(); 

will work in EF without additional efforts from our side. But only on the client! After all, the Name attribute does not map to the database.


It is a great help that EF allows you to receive warnings about client execution of a part of a request or to ban it altogether using DbContextOptionsBuilder.ConfigureWarnings .


And even after issue # 242 is made, we can only use static functions for mapping on the database function. Using the instance function ToString fails, see issue # 9213 .


A possible (albeit temporary) output for us seems to be the declaration of extension methods DbUserDefinedMethods.McsGetString(this String mcs, ...) , with signatures that coincide with multiple MultiCulturalString.ToString overloads, without taking into account the first parameter this . Requests will have a somewhat artificial look:


 var result = GetRepository<ProductProxy>()   .Where(p => p.RawName.McsGetString(enUS) == productName)   .SingleOrDefault(); 

Such extension methods are registered in the model builder, and the conversion from one expression tree to another is specified, the implementation of which practically does not differ from the implementation of the IHqlGeneratorForMethod for NHibernate.


In addition, the EF computability of the expression is analyzed on the server side without taking into account the transformation of the registered function. Therefore, DbUserDefinedMethods.McsGetString overloads that contain the unknown ORM types CultureInfo and IResourceFallbackProcess will always be calculated on the client.


There is an alternative to all the listed “crutches” in EF - this is writing your own provider. Then you can support any types and necessary instance functions. But from the point of view of support, synchronization with the original provider, which is actively developing, this architectural solution looks weak. Therefore, we will hope for a good dynamic of the new branch of the EF development.


Conclusion


We considered a possible use case and implementation of support for multilingual attributes. At the same time, for some simple scenarios of using multilingual attributes, it is enough to essentially declare only a string attribute, and to switch between localizations, it is enough to introduce special methods of the form static String InLocale(this String mcs, CultureInfo culture, ...) .


— , .


ORM , . , .

Source: https://habr.com/ru/post/313690/


All Articles