Engraving by M. Escher, "Relativity", 1953
In the previous article on the example of the domain nature of the product, we looked at our own data types for multilingual applications. We learned to describe and use entity attributes that have meanings in various languages. But the issues of storage and processing in a relational DBMS, as well as the problems of effective work in the application code are still relevant.
The IT community uses various ways to store multilingual data. These methods are fundamentally different in the efficiency of requests, resistance to the addition of new localizations, data volume, convenience for the consumer application.
However, there is still no Database Internationalization for Dummies solution in the industry. Together with you, we will try to fill this gap a bit: we will describe possible ways, evaluate their advantages and disadvantages, and choose effective ones. We are not going to reinvent the silver bullet, but the scenario we’ll be looking at is pretty typical for enterprise applications. We hope it will be useful to many.
The following code snippets are in C #. On GitHub, you can find examples of the implementation of internationalization mechanisms using two different ORM and DBMS bundles: NHibernate + Oracle Database and Entity Framework Core + SQL Server . Developers using the mentioned ORM will be interested in finding out specific techniques and difficulties in working with multilingual data, as well as blocking framework defects and prospects for their elimination. The principles and examples of working with multilingual data outlined below can be easily transferred to other languages and technologies.
Our application should work with several languages at once. On any of them, depending on the user's environment, operational and reference data will be displayed and entered. At the same time, there are scenarios that require the ability to work with data in all languages at once in one session.
For example, consider the domain entity of a product that is already familiar to us and has a name in various languages.
public class Product { public long Id {get; set; } public String Code { get; set; } public MultiCulturalString Name { get; set; } }
We formulate requirements for the storage and processing of entity data with multilingual attributes:
MultiCultural<T>
is useful).Recall that the alternative resource processing algorithm IEnumerable<CultureInfo> IResourceFallbackProcess.GetFallbackChain(CultureInfo initial)
can return different localization search orders for different initial locales:
initial
locale ru-RU
: ru-RU -> ru
;initial
en-US
locales: en-US -> en -> ru
;initial
kz-KZ
locale: kz-KZ -> kz -> ru
;initial
locales for zh-CH
: zh-CH -> zh-CHS -> zh-Hans -> zh -> en
.What features are in the largest modern DBMS for internationalization? It:
These features are supported by many large vendors. Here are links to some articles:
However, there are no “standard” storage schemes for multilingual data. There are none of them in the latest SQL standard: 2016.
Among the curious academic publications, one can note the master's thesis of the ex-head of the Multilingual Systems Research Group at Microsoft Multilingual Information Processing , 2005. The paper deals with the problems of cross-language queries, multilingual join operators, query algebra for new types of storage multilingual data and the mentioned operators.
Documentation on NHibernate pleased with the presence of the article Localization techniques . It discusses two storage methods (with variations):
It is noteworthy that this article does not even mention the rather obvious and popular storage method - multicolumn (column per locale).
No relevant materials could be found on the Entity Framework.
To begin with, we will create comparison criteria. To do this, we reformulate the requirements described above into more technical ones. As we will see further, they are all pretty tough.
Localizability. Initially, the requirement means no changes in the code when a new localization is added to the application. With regard to applications with localized data, for example, we can say that there are no changes in the data schema and ORM mappings.
Search and sort for a given locale. In terms of a multilingual string, this means the ability to use indexes on the results of the String MultiCulturalString.GetString(IResourceFallbackProcess resourceFallbackProcess, CultureInfo culture, bool useFallback)
function String MultiCulturalString.GetString(IResourceFallbackProcess resourceFallbackProcess, CultureInfo culture, bool useFallback)
for all combinations of values of its parameters that are meaningful in this application. The application IResourceFallbackProcess
is probably the only useFallback = false
, when useFallback = false
resourceFallbackProcess
does not matter, therefore the number of necessary indices for localized data is not more than 2N
, where N
is the number of different locales used in all entries in the multilingual attribute.
Search among all locales. Support is required for the function of the form MultiCulturalString.FindLocalizedStringCulture(String localizedString)
with the return type CultureInfo
.
Search for a suitable location. The user's locale is always specific, that is, it determines not only the language, but also regional parameters. And for storage in most cases it is enough to use neutral locales (which do not specify the specificity of a region), “closest” to the selected specific ones.
Therefore, from the above ten locales it is enough for us to provide storage for only four neutral ones: ru
, en
, kz
, zh-Hans
. And in ORM, it is possible to support the functionality of casting a specific locale to neutral by assigning an attribute and (or) saving an entity.
Suitability for ORM. The selected ORM should have enough extension points so that we can put all our multilingual storage fantasies into practice. Otherwise you have to write a new framework.
This option requires you to know in advance the list of necessary localizations. And for each new locale in use, we need to add a new column and a couple of indices. And so for each multilingual attribute in all tables of localizable entities. You may have to change the ORM mapping in the code and (or) database client configuration, especially for static models.
But do not immediately get upset. Changes in the code and configuration, most likely, will not become a big problem for an application in which the list of used locales and / or dynamic entities are rarely known and rarely changes. In addition, as we shall see later, the requirement of localizability one way or another will have to be relaxed for all the considered options.
name_*
-columns for the case when we do not use the search for a suitable localization, everything seems to be obvious. But what will the query look like in which the IResourceFallbackProcess
should work?Consider, for example, such a query:
var enUS = CultureInfo.GetCultureInfo("en-US"); var productName = "..."; var result = GetRepository<Product>() .Where(p => p.Name.ToString(enUS) == productName) .SingleOrDefault();
I think you will agree that it may well be implemented by the following SQL query:
SELECT pr.id_product, pr.code, pr.name_ru, pr.name_en, pr.name_kz, pr.name_zh_hans FROM t_product pr WHERE isnull(pr.name_en, isnull(pr.name_ru, '')) LIKE @p1 ORDER BY isnull(pr.name_en, isnull(pr.name_ru, '')), pr.id_product
To get an “honest” null
instead of an empty string when the requested value is missing, from the point of view of a multilingual string, use one of the overloads MultiCulturalString.GetString
. Then the filtering expression in SQL will be slightly easier:
SELECT pr.id_product, pr.code, pr.name_ru, pr.name_en, pr.name_kz, pr.name_zh_hans FROM t_product pr WHERE isnull(pr.name_en, pr.name_ru) LIKE @p1 ORDER BY isnull(pr.name_en, pr.name_ru), pr.id_product
We get that even at least one index per locale we need to build on an expression with isnull
, which is a SQL mapping of the search chain for alternative resources. Such indices are usually called functional, in SQL Server these are indices above computable columns. In the above description of the t_product
table t_product
these indexes have the suffix _stdfallback
.
About the type of request in the case of searching for a string among all localizations, we will provide the reader with an opportunity to dream up.
This method of storage and access is implemented for one of our company's clients.
In this embodiment, the multilingual attribute occupies only one column. A multilingual string can be serialized in both binary and human-readable form. Or, for example, in Oracle, a column may have an object type.
Given the current trends in the development of the SQL standard, you should pay attention to XML or JSON storage. Oracle Database, SQL Server, DB2, PostgreSQL, MySQL have pretty strong support for XML and (or) JSON.
Making such a decision, we need to take care of the possibility of obtaining (and someone - and recording) localized values by means of the DBMS. A universal and well encapsulating specific variant of serialization will be the creation of a custom scalar function that accepts a serialized value, a locale and a string for searching for a suitable localization, and returns a localized string.
SELECT pr.id_product, pr.code, pr.name FROM t_product pr WHERE McsGetString(pr.name, 'en', 'en,ru') LIKE @p1 ORDER BY McsGetString(pr.name, 'en', 'en,ru'), pr.id_product
When a new locale appears, we still have to add new functional indexes based on the results of the McsGetString
function. Adding only indexes is a safer action than adding new columns. Unlike multi-column storage, ORM mappings will probably not have to be changed.
Here we have a separate table for storing localized values. Variations of such storage are numerous. Consider only one of them.
SELECT pr.id_product, pr.code, pl.name FROM t_product pr LEFT JOIN t_product_localizable pl ON pr.id_product = pl.id_product AND pl.locale = 'en' WHERE pl.name LIKE @p1 ORDER BY pl.name, pr.id_product
The SQL query, taking into account the search for suitable localizations, is already more cumbersome:
SELECT pr.id_product, pr.code, pl_en.name, pl_ru.name FROM t_product pr LEFT JOIN t_product_localizable pl_ru ON pr.id_product = pl_ru.id_product AND pl_ru.locale = 'ru' LEFT JOIN t_product_localizable pl_en ON pr.id_product = pl_en.id_product AND pl_en.locale = 'en' WHERE isnull(pl_en.name, isnull(pl_ru.name, '')) LIKE @p1 ORDER BY isnull(pl_en.name, isnull(pl_ru.name, '')), pr.id_product
Such a request presents us with a whole bunch of surprises.
First, the cost of the request is already noticeably higher than in the previous versions.
Secondly, using such a query, it is problematic to instantiate an entity with a fully initialized multilingual attribute. Either we have to add a LEFT JOIN
for all locales whose list you need to know in advance, or stop using the MultiCulturalString
attribute, replacing it with a String
and reducing the number of covered scenarios. Another option is to support lazy loading of values for different locales within a multilingual string. Each of the alternatives has the right to life according to the specific requirements for the product.
Thirdly, we are going to support the work with multilingual attributes in existing ORMs. But we can hardly find such an extension point to add a JOIN
to the SQL query, while remaining within the framework of a simple LINQ query:
var result = GetRepository<Product>() .Where(p => p.Name.ToString(enUS) == productName) .SingleOrDefault();
Connections with the t_product_localizable
table can be replaced with subqueries, but this does not fundamentally change anything.
A very interesting result, isn't it? But the requirement of localization is satisfied.
As we expected, it was not possible to find a silver bullet among the options considered. In each of these, one or more requirements are more or less violated. However, this is not a reason to do nothing. We need to make a compromise decision, choose a middle ground.
We considered the single-column serialized storage variant to be the most balanced and promising. Therefore, in the next section, we propose to consider the features of the implementations of this option for NHibernate + Oracle Database bundles and the Entity Framework Core + SQL Server . Serialization of multilingual attributes will be done in XML, and for access to localized values from the database - use the means of the DBMS.
So, we will make an attempt to support work with entities that contain an attribute of the MultiCulturalString
type, as well as a query to search for such entities by the localized attribute value.
We will not consider the question of searching for a multilingual attribute in all locales, giving it to the interested reader.
In the implementation we proceed from the following principles:
We give a diagram of dependencies with small division into assemblies.
We will serialize the values of the multilingual string in XML. In .NET, responsibility for the content of the serialized data ( ISerializable
, ISerializationSurrogate
) and the final serialized view ( IFormatter
) are IFormatter
. And if it is quite logical to place the first responsibility on the object itself being serializable, the second one strongly depends on the use. Therefore, for formatting, apply the XmlFormatter: IFormatter
found on the Internet, using ISerializable
objects.
NHibernate, perhaps, has long been the most functional ORM under .NET. At the same time, it contains contradictory stratifications that have arisen at different stages of its development. Now versions are extremely rare, there are few contributors left. Some long-awaited corrections of defects, apparently, will never come out.
To load and save the values of a multilingual string, we need to implement an IUserType
in which to use the mentioned XmlFormatter
.
<?xml version="1.0" encoding="UTF-8"?> <MultiCulturalString xsi:type="MultiCulturalString" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://custis.ru/i18n"> <ru> </ru> <en>Chocolate Alina</en> </MultiCulturalString>
SELECT XMLCast( XMLQuery('declare namespace i18n="http://custis.ru/i18n"; for $mcs_locale in $mcs/i18n:MultiCulturalString/* where $mcs_locale/name() = $locale return $mcs_locale/text()' PASSING a_mcs AS "mcs", a_locale AS "locale" RETURNING CONTENT) AS VARCHAR2(4000 CHAR)) INTO l_value FROM dual;
To use this function in HQL , it is enough to add the implementation of ISQLFunction
to the NHibernate configuration. But besides, we want the LINQ-to-Database call to MultiCulturalString.ToString()
overloads to turn into a call to the McsGetString
function. However, for this, NHibernate has an extension point: it is enough to implement IHqlGeneratorForMethod
and also register the implementation in the configuration. The IHqlGeneratorForMethod
implementation IHqlGeneratorForMethod
expected to convert one expression tree to another.
Here, strictly speaking, that's all. Are we really great, but is everything so seamless with NHibernate? Unfortunately no!
[Test] public void TestNh2500() { using (var session = SessionFactory.OpenSession()) { var product = new Product { Code = ProductCode, Name = new MultiCulturalString(ru, ProductNameRu) .SetLocalizedString(en, ProductNameEn) }; session.Save(product); } using (var session = SessionFactory.OpenSession()) { var product = session.AsQueryable<Product>() .SingleOrDefault(p => p.Name.ToString(zhCHS)) == ProductNameEn); Assert.IsNotNull(product); Assert.AreEqual(ProductCode, product.Code); } using (var session = SessionFactory.OpenSession()) { var product = session.AsQueryable<Product>() .SingleOrDefault(p => p.Name.ToString(ruRU)) == ProductNameEn); // The next line throws AssertionException Assert.IsNull(product); } }
The same SQL will be generated for both queries:
select product0_.id_product as id1_0_, product0_.code as code0_, product0_.name as name0_ from t_product product0_ where McsGetString(product0_.name, 'zh-CHS', 'zh-CHS,zh-Hans,zh,en')=:p0;
The whole thing is in the NH-2500 defect: LINQ requests even of different sessions are cached at the session factory level and reused, despite the different values of the request parameters. Although the bug has been critical for six years, the fix will only be included in the future version 5.1. In the meantime, you can release fork NHibernate. If you need to correct any other defects of NHibernate, the low dynamics of the product makes the risks of "swelling" of your edits low.
Not so long ago, EF Core 2.0 was released , and further development is behind this cross-platform branch. For our examples, we chose it, not EF 6 , since we hope for a speedy solution to the problems described below.
Unfortunately, in EF Core we cannot implement support for a custom type, including MultiCulturalString
. But already in version 2.1 of the framework, this possibility will appear (see issue # 242 ).
public class Product { public virtual long Id { get; protected set; } public virtual String Code { get; set; } public virtual MultiCulturalString Name { get; set; } } public class ProductProxy : Product { public override MultiCulturalString Name { get => base.Name; set { _rawName = ConvertToStoredValue(value); base.Name = value; } } public virtual String RawName { get => _rawName; set { base.Name = ParseStoredValue(value); _rawName = value; } } private String _rawName; ... }
It is the proxy class that will participate in our DbContext
. As you guessed, the Name
attribute will not participate in the mapping at all. And of course, the reader would like to use the RawName
attribute used in mapping to be protected
, since ORM doesn’t interfere. But do not rush!
Already familiar to us request
var result = GetRepository<ProductProxy>() .Where(p => p.Name.ToString(enUS) == productName) .SingleOrDefault();
will work in EF without additional efforts from our side. But only on the client! After all, the Name
attribute does not map to the database.
It is a great help that EF allows you to receive warnings about client execution of a part of a request or to ban it altogether using DbContextOptionsBuilder.ConfigureWarnings
.
And even after issue # 242 is made, we can only use static functions for mapping on the database function. Using the instance function ToString
fails, see issue # 9213 .
A possible (albeit temporary) output for us seems to be the declaration of extension methods DbUserDefinedMethods.McsGetString(this String mcs, ...)
, with signatures that coincide with multiple MultiCulturalString.ToString
overloads, without taking into account the first parameter this
. Requests will have a somewhat artificial look:
var result = GetRepository<ProductProxy>() .Where(p => p.RawName.McsGetString(enUS) == productName) .SingleOrDefault();
Such extension methods are registered in the model builder, and the conversion from one expression tree to another is specified, the implementation of which practically does not differ from the implementation of the IHqlGeneratorForMethod
for NHibernate.
In addition, the EF computability of the expression is analyzed on the server side without taking into account the transformation of the registered function. Therefore, DbUserDefinedMethods.McsGetString
overloads that contain the unknown ORM types CultureInfo
and IResourceFallbackProcess
will always be calculated on the client.
There is an alternative to all the listed “crutches” in EF - this is writing your own provider. Then you can support any types and necessary instance functions. But from the point of view of support, synchronization with the original provider, which is actively developing, this architectural solution looks weak. Therefore, we will hope for a good dynamic of the new branch of the EF development.
We considered a possible use case and implementation of support for multilingual attributes. At the same time, for some simple scenarios of using multilingual attributes, it is enough to essentially declare only a string attribute, and to switch between localizations, it is enough to introduce special methods of the form static String InLocale(this String mcs, CultureInfo culture, ...)
.
— , .
ORM , . , .
Source: https://habr.com/ru/post/313690/
All Articles