“Internet in Russian” (Computerra, March 1997)

Throughout the first half of the 1990s, the Internet in Russia remained fun for geeks : “to launch the protocols in use, there is little of higher education”. Later, when the Internet became more accessible and more popular, another misfortune was waiting for him - a leapfrog with Cyrillic support. Different encodings were enough to get lost.

In March 1997, Computerra chose this problem as its “theme of the issue”, starting the story from where so many encodings came from and why they are all used, and ending with recommendations on how to live with it all. I reprint the text from the journal as is, without shortening it. I specifically rechecked that ~~Google~~ had not ~~known this text to this day~~ ; but in my opinion, such a “monument of Slavic literature” from the pre-Unicode era is worthy of being preserved in electronic form.

(Using the occasion, I will mention my two-year old topic about Katya Lazhintseva, the official creator of CP-1251.)
')

Oleg Tatarnikov

Crusaders

You're so chasing shadow that you lose presence.
From the book of Job

The concepts of the Internet, originally developed in the unitary, centralized army system of the US Department of Defense, quickly emerged from the “dictatorial paths” and are perceived today as ideas of a worldwide public information highway. Any attempts to restrict access, censorship and external influence on the Network are unambiguously perceived by the world community in hostility. No organization stands behind the widespread introduction of the Internet into our lives: it is a self-organizing system, and its main engine is the whole of humanity. This is the main difference between the world wide web and commercial networks; this is its appeal to millions and its strength. In such a light, supporters of the mass “co -ization” of information exchange in Russia look ugly, seeking to drive all Russian users into the Procrustean bed of a single encoding. Moreover, the carriers of this ideology constitute an absolute minority of Russian Internet users, even the most active. Moreover, the good intentions that they are guided by in this case do not in any way justify the forcible restriction of freedoms and the inconvenience of a huge army of users, the number of which continues to grow rapidly. For it is known, where such intentions usually pave roads.

The reason for writing this article was the numerous messages sent by email to the author and many other “Russian-speaking” subscribers of Internet resources in an unreadable form, that is, completely unreadable and cannot be decrypted (simply speaking, irretrievably ruined - consisting of one crosses "). Moreover, the main "culprits" of this were the providers of Internet resources, which are precisely obliged to protect their customers from such incidents.

Trying to understand all the existing problems and find possible solutions, I turned directly to those who are most interested in overcoming existing difficulties, that is, to software developers, Internet service providers and, naturally, their clients.

With all the obvious disagreements that exist between these groups (each of which pursues slightly different goals with one common denominator — to save oneself from unnecessary work), the need to bring them closer in order to understand common problems is obvious.

So, there are three main Internet resources (there is one problem with the Russian language, and there may be different solutions):

e-mail (Mail);
newsgroups (News);
WWW resources.

In the short term, apparently, it will not be possible to avoid using the KOI8-R encoding on the Web (the inertia of the largest Internet service providers affects it). It only remains to minimize the damage from such “sectarian” commitment. But even with the universal striving for unity, the future, in general opinion, belongs to the universal Unicode encoding that is gaining momentum, and further it is expedient to transfer all Internet resources to the HTML format (or any other one that will replace it). text along with all the fonts and layout used in it, which will solve several problems at once, including with national alphabets.

Email

Children came running into the hut,
in a hurry are father's name:
Tyatya, tyatya, our networks
dragged the dead man.
A.S. Pushkin

E-mail is a private matter of two people (the sender and the recipient). Ideally, the manufacturer of the client program can provide the user with the ability to translate from the encoding (one of the possible, actually it’s two or three), in which the message arrived, to the local one, adopted on this computer platform and / or operating system, and vice versa (in accordance with for example, with names in the MIME charset). But only by opportunity! So that the user can always refuse it and transmit the message as he pleases. In any case, this should not be done by the server. In general, any actions with correspondence (with the exception of forwarding) should be strictly forbidden, because they resemble the perusal of letters - not his (server) is a dog's business, he is only a postman, a transport. While the general rules are not worked out, and the programs work "crookedly" - various roundabout maneuvers are used, which only aggravate the situation.

The only way out of this is to place all responsibility on the sender (by default). By separate agreement, you can, of course, provide him with other opportunities, but leave the right to choose. And what do we see today? Server owners include “unnatural intelligence” and forcibly recode any message in KOI8-R. Mercy, gentlemen! How do you know how and in what language I write messages in our multinational country? Why do you think that I do not encrypt it and do not encode as I please? Why do you consider yourself omnipotent and omniscient and empower yourself to destroy private information (and this is often the case). Look at the code tables. What is transcoding? This is the addition (or subtraction) to the symbol code of a certain number, determined by the difference between the symbol positions in the code tables. If, as a result of double or incorrect transcoding, the resulting code "jumps out" from the range of 0-255, then the information about the symbol will be lost! As an exercise, try re-encoding in KOI8-R, for example, the text presented in an alternative DOS encoding, taking it as Windows 1251, and then try to recover something ...

Fanatics have “powdered brains” even for Netscape, and now the Netscape Navigator browser cannot be used at all: it forcibly recodes, for example, from Windows 1251 to KOI8-R, and without the possibility of disconnecting, and only in one direction - when sending.

In the end, there are Uuencode, MIME, Quoted-printable and other encryption 8-bit characters, invented precisely then, to ensure guaranteed transmission of messages even when the 8th bit is cut off (which will be discussed below), so that everyone will be comfortable and everyone could use the "native" encoding - and we are trying to drive everyone into one stall. Of course, not all mail programs can automatically correctly recover the text from an empty set of letters, numbers and punctuation marks, into which it then turns. So what to do? If you do not intend to make mass mailings to unknown subscribers, then you can always agree with your addressees. Moreover, in this case, it is not bad to use some special code (the supporters of server transcoding completely reject this possibility).

Yes, by sending letters to an unknown addressee (with whom you did not have time to agree on a format acceptable to both of you), you risk being trapped. But it should be on your conscience! It is easier to resort to all encodings at the same time or write in English. Or in Russian, but in English letters (which many do, even in spite of the attempts of “too smart” providers). The current state of affairs is unacceptable! Programs from version to version change design rules, data formats and coding. Servers acquire "unnatural" intelligence and "impassable" perseverance, and users, instead of doing business, are looking for means to "deceive". And the current situation is so fragile that no one can give guarantees, and thanks to all sorts of tricks, the situation is only getting worse. Judge for yourself if you were sent a letter in a specific encoding (of two or three possible realistically), even if inconvenient for you, it can be recoded by yourself. If along the way it went through two or three "violent" transformations, then the information is hopelessly lost.

Conferences on interests

>>>> But to whom the hare?
>>> to me!
>> I CHOE !!!
> i mne !!!
and me too !!!!

Teleconferencing is the only place where the question of which encoding to apply ceases to be a personal matter for everyone. If you send a message for public reading, then obviously it must be unified.

Of course, to support several different encodings for the conference is rather silly, it is reasonable to stop at one. Historically - this is KOI8-R, even if it remains until better times. In the end, participation or non-participation in discussions is also a personal matter of each and his own problem. You want to participate in existing newsgroups - “co-identify”, if you don't want to - create your own (on your server and in your “own” encoding). There are participants - there are conferences, participants leave - and conferences will disappear.

In my personal opinion, the informational value of public teleconferencing is rather low, serious discussions go to specialized mailing lists, avoiding informational “clogging,” operational information can always be found on-line, and chatting can be done via chat. “Aggressive” fidoshniki, defending their network as an alternative to the Internet, expect an answer to their messages, sometimes for months, and the discrepancy between the questions and answers that came to them creates additional confusion.

The group of teleconference participants is not the most numerous on the Web, however, as experience shows, the most influential and ... conservative: “Well, KOI8 was still used to be used. News, for example, it has to read. Still. And since I have to read something in KOI8, all the other encodings annoy me ”(Mikhail Isaev, from the discussion of Russian encoding in the relcom.talk newsgroup).

However, there are more questions than answers. The painful problem of supporting Russian from time to time comes up in existing conferences, forcing passions to flare up with a new force. When participating in these discussions, one should not rely on the opinions of individual participants: some of them, despite their young age, are “terry” conservatives, others seem like burdensome talkers, and all of them seem to be combines a sense of belonging to a certain clan, the main attribute of which is a commitment to the coding KOI8-R. Maybe she needed something just for this? How good to feel "cool", installing on your car support KOI8-R, and look down on those who have not yet done, calling them while suckers or, at least, lamer.

But nevertheless I found out the points of view of quite serious and well-known people in this field: Andrey Chernov, developer of the mail program UUPC / @ and initiator of the RFC 1489 (Internet Request For Comments) proposal, which registers the use of the Russian coding KOI8-R as a recommendation for any presentation information on the network containing the Cyrillic (and, as follows from the application, - the Cyrillic in general, without taking into account the opinions of other Cyrillic peoples); Dmitry Martynov, coordinator of rekomovsky conferences, as well as other people known to me, involved in the development and development of Internet technologies in Russia. Their opinions do not coincide in everything, and, of course, not all of them are supporters of "draconian" measures.

Speeches in conferences do not oblige anyone to anything, so they had to check and re-check the information, and, as a rule, right there without leaving the Internet. In general, the main part of the information is obtained as a result of virtual exchange. And the undoubted advantage of the latter is the possibility of its forced shutdown at any minute. Since the conferences are public and public, I reserve the right to quote some opinions freely, and the ideas expressed in personal correspondence, with the consent of the authors.

So, the key opinion: “If there were no“ servers ”in 1251, the KOI8-R drivers would install the KOI8-R machines on their machines,” Dmitry Martinov did. In the meantime, the Windows “servers” are getting bigger and bigger ...

Russian coding

For a true understanding of the spirit of the subject matter, it is especially important to master the definitions.
S. C. Kleene, "Mathematical Logic"

What is encoding? This is a method of computer representation of a variety of different characters, including letters of the alphabet, punctuation, numbers and special characters.

Discussing the burdensome coexistence of several different encodings for representing Russian letters and the problems associated with it, they look almost at Cyril and Methodius. We showed off, you know ... Forgetting that even Americans, by the general opinion, who “sharpen” everything for themselves, until recently had at least two encodings: EBCDIC (Extended Binary Coded Decimal Interchange Code) and American Standard Code for Information Interchange .

The EBCDIC scheme has long been used by IBM in mainframes and used 8 bits to represent characters. But the Americans decided to save (the American character set is the smallest, even the British need additional icons, for example, to denote pounds) and adopted the ASCII scheme, where only 7 bits of a byte are used to encode characters. The Americans decided that 128 positions would be enough for the representation of printed characters (in fact: 27 lowercase letters, 27 uppercase letters, 10 numbers, a dozen punctuation marks and all), and the stock would remain.

There were other encodings, and for a long time they all coexisted on equal terms. But suddenly, the US government "took it into his head" to support the ASCII-encoding at the state level, and everyone podravnilis on the main customer. And the centralized and planned Soviet national economy did not even make such a small thing, but, on the contrary, apparently for mocking purposes, accepted several GOSTs on codings, and then let the matter take its course. Soon all the "stillborn" GOSTs were forgotten, and today we use what the "Russifier" gave us "(the national identity of specific locators in this case does not matter, since all the" living "localizations were made at the request of Western firms or their "fault").

	00	01	02	03	04	05	06	07	08	09	0A	0B	0C	0D	0E	0F
80
90
A0		Yo
B0	BUT	B	AT	R	D	E	F	H	AND	Th	TO	L	M	H	ABOUT	P
C0	R	WITH	T	Have	F	X	C	H	Sh	U	B	S	B	Uh	YU	I
D0	but	b	at	g	d	e	well	s	and	th	to	l	m	n	about	P
E0	R	with	t	at	f	x	c	h	sh	u		s	s	uh	Yu	I
F0		yo
ISO 8859-5 Character Table

	00	01	02	03	04	05	06	07	08	09	0A	0B	0C	0D	0E	0F
80	BUT	B	AT	R	D	E	F	H	AND	Th	TO	L	M	H	ABOUT	P
90	R	WITH	T	Have	F	X	C	H	Sh	U	B	S	B	Uh	YU	I
A0	but	b	at	g	d	e	well	s	and	th	to	l	m	n	about	P
B0
C0
D0
E0	R	with	t	at	f	x	c	h	sh	u		s	s	uh	Yu	I
F0	Yo	yo
CP 866 Character Table (DOS)

	00	01	02	03	04	05	06	07	08	09	0A	0B	0C	0D	0E	0F
80
90
A0									Yo
B0									yo
C0	BUT	B	AT	R	D	E	F	H	AND	Th	TO	L	M	H	ABOUT	P
D0	R	WITH	T	Have	F	X	C	H	Sh	U	B	S	B	Uh	YU	I
E0	but	b	at	g	d	e	well	s	and	th	to	l	m	n	about	P
F0	R	with	t	at	f	x	c	h	sh	u		s	s	uh	Yu	I
CP 1251 Character Table (Windows)

	00	01	02	03	04	05	06	07	08	09	0A	0B	0C	0D	0E	0F
80	BUT	B	AT	R	D	E	F	H	AND	Th	TO	L	M	H	ABOUT	P
90	R	WITH	T	Have	F	X	C	H	Sh	U	B	S	B	Uh	YU	I
A0
B0
C0
D0														Yo	yo	I
E0	but	b	at	g	d	e	well	s	and	th	to	l	m	n	about	P
F0	R	with	t	at	f	x	c	h	sh	u		s	s	uh	Yu
Macintosh Character Table

Extended ASCII Character Tables (IS08859-5, DOS 866, WINDOWS 1251, Macintosh)

Endlessly stumbling under the moon
Three foreheads considered ribs to me.
Got down in the third ...
Got down for the fifth time ...
How bad things are doing with us !!!
S. Statin

The extension of the ASCII table occurred at the request of European countries (and, again, right up to government intervention). To represent the printed characters of most European alphabets, it was necessary to “return” only the eighth bit in a byte. The extra bit is another 128 characters, and all diacritical characters of national alphabets could be accommodated there. This is how the character table Extended ASCII was created. However, the American software already created by that time was designed for 7-bit encoding, and many problems arose, the description of which is beyond the scope of this article.

Our alphabet is completely different from Latin, and there are more characters in it (66), so it, like some other European alphabets, did not fit in the Extended ASCII table (ISO 8859-1 or Latin-1, ISO - International Standards Organization, then there is the International Organization for Standardization). We had to invent separate tables for each language, and ours turned out to be the fifth in a row - ISO 8859-5. The resulting pure product has not taken root. ISO 8859-5 was shattered by both Russian computer fragmentation and the development of IBM - PC, in the operating system of which Bill Gates and Microsoft used pseudographics (vertical and horizontal lines, various angles, rectangles, etc.), which took the place of Russian letters in the ISO table 8859-5. I had to urgently "shove" Russian letters in places not occupied by pseudographics. It's funny that no one accepted the standards (or rather, there were several of them, so nobody paid attention to them), but the problem was solved, and as a result of MS-DOS Russification, the “alternative” encoding 866 appeared. Why the above-mentioned Microsoft company did not fit the 866th for Windows, of course - there is no need for pseudographics. But ISO 8859-5 did not fit, it seems, from the principle of doing everything independently (although A. Chernov claims that both CP 1251 and CP 878 (KOI8-R) are based on IBM standards - and Microsoft has nothing to do with it; CP - Code Page (i.e. code page). One way or another, but for "Russian" Windows, the encoding Windows 1251 (CP 1251) was adopted, the most common today.

Do not forget to also users of Macintosh computers, which in general no decree! For them, they made a special encoding, just for the holiday, so that life would not seem like honey to them and they would use only native software. A ISO 8859-5, so as not to be offended, began to dignify the main, but almost no one uses it anyway.

Now let's imagine a Russian programmer creating an information system for his enterprise. He needs an appropriate code table and a program that supports it. If the program is “strongly” American (does not process the eighth bit), then it modifies it or writes its own. In the end, the problem is solved.

The situation is more complicated with communication programs. Email correspondence goes in different, sometimes unpredictable ways. And the "evil" American car can "cut off" (reset) the 8th bit. Now there are almost no such machines left (although I was immediately shown such a “monster” as CompuServe), in any case, they are not intermediate when forwarding.

However, when the Internet only appeared in Russia (the years 90-91), this possibility was far from hypothetical, but very real. So the following encoding - KOI8-R, used in the mid-80s, when the 7-bit terminals were still alive, showed itself quite well in the new, networked application. A certain, rather conservative Russian network culture developed, whose representatives occupy the key posts, so that even a massive influx of Microsoft Windows users (more than 90 percent of clients) cannot yet carry it. Who will win, and who will win at all - we'll see. I think that the problem will wither away by itself (and, perhaps, together with the structure in its present form), and there is every reason for that. In the meantime, "forced coization" is flourishing.

Character table KOI8-R

Imagine a letter written in Russian, sent by e-mail and running into the "evil" American server on the way (and sometimes, between neighboring houses, letters go through America — providers do not agree), cutting off all the eighth bits from letters. After such a "circumcision" those letters that were Russian, become Latin. Those whose number in the table is less than exactly 128. A significant number of letters of our alphabet has phonetic counterparts in Latin. For example, P and R, P and R. In addition, there are several coinciding in writing. So, it is advisable to arrange the Russian letters in such a way that they differ from similar Latin letters by 128! Then the loss of the eighth bit will turn the message into a kind of transliteration (consisting of one Latin alphabet), the meaning of which can still be recovered and read in Russian. Unpleasant, but understandable.

KOI8-R (information exchange code, 8 bits) - this is such a table. And it is precisely from the very beginning that it is used to exchange mail and news in Russia. The first letters went in the language of “ruglish”, the same phonetic equivalent, when “hello” looks like “zdrawstwuj”. However, all letters were read! And what to do - mnogie do sih por tak pishut. But today it is for other reasons.

So, KOI8-R today is a de facto network coding. Why? The above considerations could already be neglected. Today, some supporters refer to the description of the encoding in RFC 1489, proposed in July 1993 by Andrei Chernov, and hence the “standard”. So what? In other RFCs, ISO 8859-5 is also proposed (for example, RFC 1700). Anyway, RFCs are recommendations, like everything else on the Internet. And if the recommendations do not agree with life and cause inconvenience to the absolute majority of users, should not other, more convenient recommendations be adopted?

From the point of view of most users, the undisputed leader is Windows 1251. The only “serious” argument against it (with the exception of tradition) was expressed by D. Martynov: from the perspective of a programmer, the 1251st is bad because the letter “I” falls into the place of 0xFF, iz- for which some C programs that consider (char) (getchar ()) == - 1 a sign of an error, behave inadequately with this letter. But then KOI8-R is also not good, which also has “Kommersant” in it, and only the “dead” ISO 8859-5 is suitable.

In addition, the “homeless” encoding KOI8-R exists in such ugly implementations, which seriously raises the question of professionalism in its use. That is, there is the letter "E", then no, that is, pseudographics, then no. «» ( ), . , KOI8-R - ( ). : . «» , , — . . -«» ( ) .

, - , — ! , IP- ( ), . . : « , , ». , , , .

, Windows- , «» , « ».

Is there even a candidate for a single encoding? Yes, and it seems that his accession to the law is not long to wait, although for completely different reasons ...

	00	01	02	03	04	05	06	07	08	09	0A	0B	0C	0D	0E	0F
80
90
A0				yo
B0				Yo
C0	Yu	but	b	c	d	e	f	g	x	and	th	to	l	m	n	about
D0	P	I	R	with	t	at	well	at	s	s	s	sh	uh	u	h
E0	YU	BUT	B	C	D	E	F	R	X	AND	Th	TO	L	M	H	ABOUT
F0	P	I	R	WITH	T	Have	F	AT	B	S	H	Sh	Uh	U	H	B
Character table KOI8-R

And the Chinese are even harder ...

( , - ) 50 . / , , , . 32 . , , , . 50- , 1850 , . , , , «» . 1850 Unicode . , .

-
(«Popular Science» №8, 1996. « » №1, 1997.)

. 102 , 13 . «» 5 .

, , (, 50- , ). . . , .

, , .

«» , 5 10 . , , . «» () .

. 10- , . - , , . 2,7 . « » 24 , , .

Unicode

Unicode? Unicode, , Taligent, Microsoft, Xerox, NeXT Computer, Sybase, . Unix, NeXT, Windows NT, Windows 95 . , ISO 8859-1,2,3,4,5… ( ), , , , charset: koi8-r. , , , . , Unicode, — , ( , Unicode ). « », . Unicode , . , ( , «» , «» , , a Unicode ). , Unicode Java — .

, BrowserWatch .
Platform		( )
Windows	19959	52,3
Macintosh	9460	24,7
OS/2	3567	9,34
Unknown	2004	5.25
Unix	1708	4,47
Amiga	1265	3,31
Sega Saturn	131	0,34
WebTV	52	0,13
NeXT	ten	0.02
VM/CMS	6	0.01

...

.
— , ?
— .
— ?
— , .

, , - ( ). , . , , Windows 95, , , , , . . , , , , « ». , , , . , . , , , . , , « » .

, . , . . , KOI8-R, ( WWW). «» , -. , . , , ( ). , . , , , , , , !

, , , (, ). , DOS UUPC ( DOS' . — UUPC/@). , , DOS «» . KOI8-R, — , . . , UUCP, UUPC, - . , , -, (, , UUCP).

UUPC — , «» DOS, /. , , UUPC — , .

- — Windows Microsoft Internet Mail & News. , KOI8-R, / 1251 charset ( — ). , , «» . MIME- (Multipurpose Internet Mail Extensions), — , Base64 Quoted printable. .

Internet Explorer, Internet Mail, , HTML-. , , «» Microsoft KOI8-R, «» , Explorer, «», Netscape; Windows 95 , , , .

. , Internet Mail : , . / . .

Microsoft Mail & News ( !), . — ( Netscape), ( ). , ( ), … ! , . , ( Netscape Communicator 4.0, , Microsoft, - ).

, , Netscape, Windows 95 Microsoft Mail & News. Internet Explorer . …

, , ?
- rtychef. lbl demb?
- Da normal'no, au tebja?
From the conversation

Hello, dear user! We present you the global Internet. Here you will find everything you need. Using the Internet is easier than ever: install programs received from your Internet service provider, launch a browser (it's a browser, it's a watchman, it's a wanderer), dial the provider and get on its server. After examining the contents of the server, you find a hyperlink to the page of jokes (for example, www.kulichki.com/anekdot ), which is located on another server. "ABOUT! Jokes! ”- you think and, anticipating pleasure, click to move to another server. A page with anecdotes slowly appears on your monitor. Let me ... And where are the jokes? This is not jokes. This is a set of Greek or some other icons. Probably some kind of error on their server. You return to the supplier and find a link to the political news page. “Well, how is Chubais?” - you think, downloading the news. After that, it turns out that Chubais now speaks a mixture of English and it is unclear what other language consisting of capital letters of the Russian alphabet, you freeze in bewilderment. Why is the supplier on the server all in Russian, and the rest - no? ..

Roma Voronezh

For lovers of difficulties

Any change in the body, be it a disease or health, is reduced to the movement of substances in space ... but the demons cannot produce this movement, since it is available only to God. It is clear from this that demons cannot produce any, at least actual, bodily change, and therefore, such transformations must be attributed to some secret reason.
Witch Hammer

There are several ways of processing the transmitted e-mail (similar reasoning applies to news), and accordingly, the possibilities hopelessly spoil the message. These methods naturally flow from the way mailing is organized. The user interacts with the mail program, which allows him to send and receive letters, organize storage, maintain address books, etc. We can say that the main function of the mail program is to provide a convenient user interface. To send or receive a letter, the mail program accesses the mail server, which has no user interface, but owns the subtleties of correspondence routing. Then, directly or through intermediate machines, the letter is transmitted to another mail server, which serves the addressee. That puts the letter in a personal mailbox from where the recipient's mail program takes it.

Suppose that letters need to be recoded. According to the “established tradition,” both the mail program and the server can do this. I have already shown preference to the first option, and taking into account the existing practice, the damage from such a decision is minimal.

Recoding by the delivery man

I will not eat, but I will eat each ...

In the course of the years of work, Internet service providers have “formed an opinion” that it will be much better if the software of the mail machine recodes mail itself. Then the only thing the user has to do is disable encryption of eight-bit characters (otherwise, for obvious reasons, no one can restore anything later). Let me stress once again that I am an active opponent of this approach and even many providers consider it a temporary measure.

Two programs are engaged in sending / delivering mail on the server. The first is used to send mail and is called "SMTP-server." SMTP (Simple Mail Transfer Protocol) is an Internet Mail Transfer Protocol. The SMTP service accepts letters by storing them on a mail machine. Directly to the user, on his initiative, another program sends a letter - the POP3 server (Post Office Protocol).

Then everything is obvious: the SMTP server is "forced" to translate letters from the sender's encoding to the network encoding (today it is KOI8-R). The POP3 server has to translate letters from KOI8-R to the encoding that the recipient wants to receive. Which one? .. And which one? ..

Where and where to convert

Where do you get information about the encoding of the letter:

from the service fields of the letter header;
from configuration files;
have a dedicated virtual server to work with each of the encodings;
determine the encoding of the contents of the letter.

The first approach puts the responsibility on the user's mail program. She should include the correct charset in the letter header. It would be the most correct approach for any Germany, but not for Russia. We even have conference coordinators using the wrong charset, and then blaming too clever software that does everything through the ass (read, follow the standards), when everyone else (read, do not comply with any agreements, just “do as I”) all fine. With universal irresponsibility and the universal “deception” of programs and standards, this decision leads to dire consequences.

The second approach is primitive. The administrator of the mail machine once and for all determines that the client machine with such a network name or address works in the encoding, say, Windows 1251. And now all correspondence sent from this machine is considered to be represented only in this encoding.

And if the machine does not have a fixed address, as it usually happens, and the user, connecting to the provider, receives a dynamic IP address for each session, but still contacts his mail server and mailbox, then this solution does not work. I'm not even saying that you can’t send anything else from this machine.

Dedicated virtual server. Let's run on the same mail machine three or four mail servers at the same time, one for each encoding, and each server will be assigned a separate name, for example:

win.mail.access.ru;
alt.mail.access.ru;
koi.mail.access.ru;
iso.mail.access.ru.

We will offer Windows users to apply for their mail on the win.mail.access.ru server, DOS users on alt.mail.access.ru, etc. It is enough for the client to correctly specify the address of the mail machine when configuring their mail program. Moreover, in this case, dividing the outgoing SMTP stream and the incoming POP3 into different encodings, one can gain additional flexibility. Suppose that Netscape, as mentioned above, forcibly translates Windows 1251 into KOI8-R, but does not produce a reverse process. Then we define SMTP - KOI in it, and POP3 - Win, and, thus, outgoing correspondence recodes Netscape, and incoming - the server. You can try other chains with exotic destinations - there are many options. This, in my opinion, is the most correct decision (if at all something needs to be done). The user should still put down the name of the server, and do not consider it so stupid and do not leave any freedom, as in the previous case. And you can always reserve a default option, which does nothing, but simply transfers as it should be. I have just such a provider. And I praise him not because he is mine, - and he is mine, because he does this.

I didn’t like the idea of automatic detection of encoding for all apparent temptation. It is reliable to determine in which encoding the text is presented, and even with our literacy, by recognition technology. Of course, such a program is enough to learn how to distinguish Russian from abracadabra. And if I want to write in Tatar! Moreover, some providers express the opinion that unrecognized letters should not be sent to the addressee! Well, how to call them after that? Mailing any abracadabra is the constitutional right of users.

There is no universal solution today. Even when you try to foresee everything. From the example of Gregory Naumovets given in the conference: “Recently, I needed to send one message in Cyrillic via the mailing list to several dozens of people with different servers (one recoded KOI8 <-> 1251, others not) and mailers (from Eudor to Bmail ). I think it’s necessary to make everyone Cyrillic immediately visible without switching fonts or encodings. Therefore, I include four pieces in the letter: (1) English, (2) KOI8, (3) 1251 and (4) such that should turn into KOI8 in case of recoding by the POP server or software of the recipient according to the table KOI8-> 1251. Well? Anyway, one of the Dmail'ov addressees received the answer: “I can’t read the letter, because it doesn’t have the begin line ” (???). It turns out that one server was even smarter than me on the way and for some reason rolled up my letter in Base64. ”

World wide web

The World Wide Web lives by its own laws. Layout issues in HTML have annoyed professionals for a long time, for them it is two steps back. Among the attempts to overcome this problem, there are those who, in passing, solve our “Russian” task.

What do we have today? The idiocy of the situation with encodings is evident here - a classic Russian stone at the crossroads of each site: “choose your encoding”. This is a completely abnormal situation. It is necessary to duplicate (triple and quadruple) the content, increasing the likelihood of errors and blunders, and this decision raises more questions than answers. What to do, for example, with filling in forms (interactive questionnaires) or server-parsed HTML (with macro substitutions), etc.?

Of the modern alternative solutions, two are possible:

transcoding the base page on the fly using the selected CGI-script (that is, the program running on the server);
automatic detection of the required encoding on request.

In general, none of them is satisfactory, although it is more convenient than simple duplication. The first suffers from some inconvenience in the use and increase in access time. The second is not always feasible, since it is based on the fact that the local browser transmits to the server some information about itself, sometimes including the type of platform / operating system, which is not always true and does not guarantee that the user has the fonts in the correct encoding.

It should be taken into account that the number of users working exclusively with e-mail will steadily decrease, and the number of on-line users will constantly grow. And such users will need additional services and, of course, the best display quality and diverse service. The current user of the Russian network is basically a programmer (about three-quarters, according to some polls), and they, as you know, are unpretentious, dirty, lazy and sloppy. Until the humanities (including printers, artists and designers) get into the “web”, there will not be any shifts for the better.

Therefore, we will focus, as always, on the West (and mostly - “wild”, that is, American), where this has already happened (and according to some trends, and we are not far off). Such a user is unlikely to explain about the "circumcision" of the eighth bit, but he will immediately notice the sloppy layout, errors and "crooked" fonts. Where is the exit?

There are two of them:

layout and save Web-pages in PDF-format (Adobe Acrobat);
using Web fonts sent with the document and CSS HTML extensions.

About the PDF-format, I will not particularly spread, although, from my point of view, this is the best way out (independent of the encoding and language, or the platform and operating system). Unfortunately, Adobe has “slept through” the wave of the Internet, and although Acrobat files are supported by many Web browsers (there is usually a corresponding plug-in), they do not become the standard.

CSS extensions resemble the ancient stages of development of programming languages and development tools (as well as the Internet as a whole, with all the newness of its technologies, it constantly reminds something archaic - in one part of it, then in another). Those interested can familiarize themselves with them at W3, but it seemed to us a temporary measure (and maybe, "stillborn").

Web fonts on guard of Russian language

So, the humanities came to the Internet. HTML documents evolve from simple <HR> and 3D buttons to professional, well-designed pages, the purpose of which is to attract and retain non-programmer visitors. One of the most important drawbacks of the current development of Web-pages, from the point of view of printers, was the inability to rigidly install the font, as a result of which the designer cannot be sure that the client will see exactly what he intended on the screen.

The best solution, which guarantees the correctness and quality of fonts, is considered today to be embedded directly into Web documents and transferred to the user (usually only for viewing). It is proposed to use both main types of fonts: TrueType and PostScript. TrueType fonts, with proper hinting, allow for a better image on the screen, while PostScript fonts, being the production standard in publishing, provide high-quality printing.

The use of embedded fonts is hampered only by an increase in the size of the HTML file and, as a result, by the duration of the transmission while maintaining decent quality playback.

Until recently, specifying specific fonts in HTML was almost impossible. The designer chose the font, included it in the definition of the Web page, and the user who does not have such a font on his machine saw the one that automatically (by default) substituted the browser (in our Russian case, the browser could substitute the font in which there is no Russian letters). Even the users themselves could not always change the defaults embedded in browsers, and the result went even further from the design (if it was at all). Many, of course, used a graphical representation of the text, but this led to rather cumbersome files, which, moreover, were not readable by text browsers (such as, for example, Lynx). When you embed a font in the document, these problems are eliminated. The user will see exactly those fonts that were laid in the design, even if they are not installed on his computer. Leading browser manufacturers announced support for embedded fonts.

Another advantage of embedded fonts is the quality that can be achieved when displaying and printing (this is especially true for large documents that are difficult to read from the screen). When a font is sent with the document, all the information about the characters is added to it, including hinting and other font tools used by its creators to improve the quality of display and printing. Moreover, it is planned to transmit only those characters that were used in the document (a subset of the font), which will provide additional savings.

Further reduction in the size of the transferred files can be provided by additional compression (compression and, if possible, lossless). As more and more creative people participate in the preparation of publications on the Web, and they also do not understand communication problems, the file sizes will constantly grow.

In order to use a subset of the font, it is necessary that the applications with which the document was created support this method. Microsoft Office 97 has this feature, allows you to embed fonts in a document, check the document for the occurrence of fonts in it and associate an encrypted version of the font with the document. Some fonts will not allow embedding in the document, and the indicated opportunity will not extend to them. When such a document with legally embedded fonts comes to its intended purpose, they will be decrypted.

Not so long ago, I created a homepage on the Internet. At www.online.ru/people/aship . Immediately after that, to my considerable surprise, I began to receive quite a lot of letters from abroad with impressions of this page. What is very nice, all the letters were written in Russian (although sometimes clumsily), in transliteration. And all, all the polls, contained one single question: “What kind of leapfrog did you arrange with your encodings ?!” I offer you an example of one such letter. It is more eloquent than a dozen articles.
  Ya v AustraliY i ottuda tebe privet!
 prejnyaya stranichka tchitalasy cherez KOI8.
 A chto mne s novoy sdelaty?  okolo 40 statey
 po russkim fontam prochel ........ Pochemu ne
 dogovoritesy pabotaty s odnoy kodirovkoy.  Drug,
 my vse jelaem chitaty smotrety vashy sayty. 
Andrey Shipilov

Font compression software is based on Agfa MicroType Express and Adobe CFF technologies that provide compression without loss of quality. MicroType Express works with TrueType and PostScript fonts, a CFF with PostScript fonts only. Compression with Agfa fully retains the original information and, in combination with the font subset transfer method, provides 90 percent compression for Latin-based fonts and up to 99 percent for Chinese, Korean and “us”. In contrast to the “lossy compression” methods used in storing images and films, MicroType Express technology retains full compliance with the original letters and their appearance on the Internet, supporting decompression on the fly. At the same time, all hints are preserved, which ensures high quality of display and printing. Microsoft Internet Explorer has pledged to support these technologies.

Open Tour

Microsoft and Adobe are working together on a new universal font format called Open Tour, combining TrueType and PostScript. Part of the technology being created is font compression. The Open Tour will improve the management of existing fonts and create a format that will work with the new generation of computer fonts designed for Web use. - , , . Open TrueType PostScript Web-. Microsoft Open . Adobe Open , -.

Open « » « ». , Agfa, Adobe Microsoft WWW , , HTML- . , , Web- , , «», .

HTML- «» . , : — , « » !

www.medlux.ru .

www.medlux.ru , ( ), .
( Web- .)
Platform
39030 Win95
27094 Windows (.)
10321 Win 16
2903 WinNT
1440 Win32
1254 FreeBSD
1181 SunOS
863 Unknown (Lynx)
595 Macintosh
583 OS/2
540 Linux
192 Mac PowerPC
156 HP-UX
132 AIX
97 BSD/386
80 OSF1
34 AlphaServer
28 BSD/OS
sixteen OpenVMS
14 IRIX64
7 Alpha
6 WebTV
:
MS — 93,3%, Unix — 5,1%, Mac — 0,9%, OS/2 — 0,7%

« »,

WWW- , , .

« » . ( — ), , . «». , . « », «», .

Web- (WebTV), , — , . — WEB. WebTV, .

Web- ( ). , . : Web- , , , .

, WebTV , (, « »), .

WebTV : ( ), , , . , , , -. , ( 300-400 ), . — , WebTV .

, WebTV — , «» , , . .

, , KOI8-R?

Source: https://habr.com/ru/post/178275/

All Articles

	Platform
39030	Win95
27094	Windows (.)
10321	Win 16
2903	WinNT
1440	Win32
1254	FreeBSD
1181	SunOS
863	Unknown (Lynx)
595	Macintosh
583	OS/2
540	Linux
192	Mac PowerPC
156	HP-UX
132	AIX
97	BSD/386
80	OSF1
34	AlphaServer
28	BSD/OS
sixteen	OpenVMS
14	IRIX64
7	Alpha
6	WebTV