The planet Mars is not the first year inhabited by robots. Here and there unmanned electric vehicles and flying drones appear, and in programs written in Java, problems with encodings appear with enviable regularity.
I want to share my thoughts on why this is happening.
Suppose we have a file that stores the text we need. To work with this text in Java, we need to drive the data into a String. How to do it?
readFileString readFile(String fileName, String encoding) { StringBuilder out = new StringBuilder(); char buf[] = new char[1024]; InputStream inputStream = null; Reader reader = null; try { inputStream = new FileInputStream(fileName); reader = new InputStreamReader(inputStream, encoding); for (int i = reader.read(buf); i >= 0; i = reader.read(buf)) { out.append(buf, 0, i); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { if (reader != null) { try { reader.close(); } catch (IOException e) { e.printStackTrace(); } } if (inputStream != null) { try { inputStream.close(); } catch (IOException e) { e.printStackTrace(); } } } String result = out.toString(); return result; }
Please note that reading a file is not enough just to know its name. You also need to know in which encoding it contains the data. The binary representation of characters in the memory of a Java machine and in a file on a hard disk almost never coincides, so you cannot simply take and copy data from a file into a string. First you need to get a sequence of bytes, and only then convert to a sequence of characters. In the example above, the class InputStreamReader does this.
')
The code is quite cumbersome, despite the fact that the need to convert from bytes to characters and back occurs very often. In this regard, it would be logical to provide the developer with auxiliary functions and classes that facilitate the work on recoding. What did the Java developers do for this? They have introduced functions that do not require encoding. For example, the class InputStreamReader has a constructor with one parameter of type InputStream.
readFile String readFile(String fileName) { StringBuilder out = new StringBuilder(); char buf[] = new char[1024]; try ( InputStream inputStream = new FileInputStream(fileName); Reader reader = new InputStreamReader(inputStream); ) { for (int i = reader.read(buf); i >= 0; i = reader.read(buf)) { out.append(buf, 0, i); } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } String result = out.toString(); return result; }
It became a little easier. But here, Java developers have buried a serious rake. They used the so-called “
default character encoding ” as the encoding for data conversion.
The default charset is installed by the Java machine once at the start based on data taken from the operating system and is stored for informational purposes in the file.encoding system property. In this regard, the following problems arise.
- The default encoding is a global parameter. It is impossible to set one encoding for some classes or functions, and another encoding for others.
- The default encoding cannot be changed during program execution.
- The default encoding depends on the environment, so you cannot know in advance what it will be.
- The behavior of methods depending on the default encoding cannot be reliably covered with tests, because there are a lot of encodings and their set of values ​​can be extended. Any new OS with UTF-48 type encoding may come out, and all tests on it will be useless.
- When errors occur, you have to analyze more code to find out which particular encoding a particular function used.
- The behavior of the JVM in the event of a change in the environment after the start becomes unpredictable.
But the main thing is that an important aspect of the program’s work is hidden from the developer, and he may simply not notice that he used a function that will work differently in different environments. The FileReader class does not contain any functions that allow you to specify the encoding, although the class itself is logical and convenient, so it encourages the user to create platform-specific code.
Because of this, amazing things happen. For example, a program may incorrectly open a file that it itself has previously created.
Or, say, we have an XML file that has encoding = “UTF-8” written in the header, but in a Java program this file is opened using the FileReader class, and hello. Somewhere will open normally, but somewhere not.
The problem file.encoding manifests itself especially vividly in Windows. In it, Java uses the ANSI encoding as the default encoding, which for Russia is equal to Cp1251. Windows itself says that "this parameter sets the language for displaying text in programs that do not support Unicode." What does Java have to do here, which was originally conceived for full Unicode support, is unclear, because for Windows, the native encoding is UTF-16LE, starting somewhere with Windows 95, 3 years before the release of 1st Java.
So if you saved a file on your computer using your Java program and sent it to your colleague in Europe, the recipient may not be able to open it using the same program, even if the operating system version is the same as yours . And when you move from Windows to Mac or Linux, then you can not read your own files.
But there is still a Windows console that works in OEM-coding. We all watched how, up to Java 1.7, any output of Russian text in a black window using System.out produced crocodiles. This is also the result of using functions based on default character encoding.
I solve the problem of encodings in Java as follows:
- I always run Java with the -Dfile.encoding = UTF-8 option. This allows you to remove the dependence on the environment, makes the behavior of the programs deterministic and compatible with most operating systems.
- When testing my programs, I definitely do tests with non-standard (incompatible with ASCII) encoding by default. This allows you to catch libraries that use classes like FileReader. When I find such libraries, I try not to use them, because, firstly, there will be problems with encodings, and secondly, the quality of the code in such libraries is seriously in doubt. I usually run java with the -Dfile.encoding = UTF-32BE option to be sure.
This does not provide an absolute guarantee against problems, because there are also launchers who run Java in a separate process with the parameters that they consider necessary. For example, many plugins to Antu did this. The ant itself worked with file.encoding = UTF-8, but some code generator called by the plugin worked with the default encoding, and the usual porridge from different encodings was obtained.
In theory, over time, the code should become better, the programs more reliable, the formats more standardized. However, this does not happen. Instead, there is a surge of errors with encodings in Java programs. Apparently, this is due to the fact that people who are not used to solving the problem of encodings immigrated to the Java world. For example, in C #,
UTF-8 encoding is used by default, so a developer who has moved from C #, quite reasonably believes that InputStreamReader by default uses the same encoding, and does not go into the details of its implementation.
Recently stumbled upon a similar
error in the maven-scr-plugin .
But the real surprise had to experience when moving to eight. Tests have shown that the problem with the encoding was getting lost in the
JDK .
Bug in JDK 8 import java.nio.charset.Charset; import java.nio.charset.StandardCharsets; import java.util.Arrays; import javax.crypto.Cipher; public class PemEncodingBugDemo { public static void main(String[] args) { try { String str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ012345467890\r\n /=+-"; byte ascii[] = str.getBytes(StandardCharsets.US_ASCII); byte current[] = str.getBytes(Charset.defaultCharset()); if (Arrays.equals(ascii, current)) { System.err.printf("Run this test with non-ascii native encoding,%n"); System.err.printf("for example java -Dfile.encoding=UTF-16%n"); } Cipher.getInstance("RC4"); } catch (Throwable e) { e.printStackTrace(); } } }
On the nine it is not reproduced, apparently, they have already repaired it.
Searching through the error database, I found another recently closed
error related to the same functions. And characteristically, they even fix is ​​not quite right. Colleagues forget that for standard encodings, starting with Java 7, you should use constants from the
StandardCharsets class. So ahead, unfortunately, we are still waiting for a lot of surprises.
Running grep on the JDK source, I found dozens of places where platform-specific functions are used. All of them will work incorrectly in an environment where the native encoding is incompatible with ASCII. For example, the Currency class, although it would seem that this class should take into account all aspects of localization.
When some functions begin to create problems, and there is an adequate alternative for them, it has long been known what to do. It is necessary to mark these functions as outdated and indicate what they should be replaced. This is a well-proven mechanism of deprecation, which even plan
to develop .
I believe that functions that depend on the default encoding should be outdated, especially since there are not so many of them:
Function | What to replace |
---|
Charset.defaultCharset () | remove |
FileReader.FileReader (String) | FileReader.FileReader (String, Charset) |
FileReader.FileReader (File) | FileReader.FileReader (File, Charset) |
FileReader.FileReader (FileDescriptor) | FileReader.FileReader (FileDescriptor, Charset) |
InputStreamReader.InputStreamReader (InputStream) | InputStreamReader.InputStreamReader (InputStream, Charset) |
FileWriter.FileWriter (String) | FileWriter.FileWriter (String, Charset) |
FileWriter.FileWriter (String, boolean) | FileWriter.FileWriter (String, boolean, Charset) |
FileWriter.FileWriter (File) | FileWriter.FileWriter (File, Charset) |
FileWriter.FileWriter (File, boolean) | FileWriter.FileWriter (File, boolean, Charset) |
FileWriter.FileWriter (FileDescriptor) | FileWriter.FileWriter (FileDescriptor, Charset) |
OutputStreamWriter.OutputStreamWriter (OutputStream) | OutputStreamWriter.OutputStreamWriter (OutputStream, Charset) |
String.String (byte []) | String.String (byte [], Charset) |
String.String (byte [], int, int) | String.String (byte [], int, int, Charset) |
String.getBytes () | String.getBytes (Charset) |
I almost forgotYes, and what about the spacecraft on Mars?
Part of the software for the Martian probe Schiaparelli was written in Java, on the current version 1.7 at that time. Launched the product in the spring, and the path to the destination was six months. While he was flying, the JDK was updated at the European Space Agency.
So what? Software development for the current mission has been completed, software must be done for the next one, and we are still sitting on the seven. NASA and Roskosmos have already moved to the eight for a long time, and then there are lambdas, streams, default interface methods, a new garbage collector, and in general.
They were updated and, before landing, they sent a control command to the spacecraft not in the encoding in which he had expected.