Pseudo-identification of the cello

When Alexei TheShade Shipilev talked about the behavior of Java strings with a zero hashcode value, he cited as an example the string " " . When FindBugs warns you about problems with calculating the absolute value of the hashcode equal to Integer.MIN_VALUE, it gives examples of lines that have such a hashcode - "polygenelubricants" or "DESIGNING WORKHOUSES" . Where did these examples come from? How to make a beautiful string with the specified hashcode?

There are 2 ³² different hashcodes - just over four billion, and there are about a hundred thousand words in human language. It is almost impossible to find one word with the necessary hashcode, but a combination of two words is quite possible. If you add more variations like prepositions, a choice will appear.

Enumerate all possible combinations for a long time, but you can optimize the process by performing simple transformations over the string hashcode formula. Let's write a phrase generator with a given hashcode. We will write in pure Java 8, in the now fashionable functional style.

So, the hashcode formula h is from the string s , where l (s) is its length, and s [i] is the i -th symbol:

')
Calculations are done modulo 2 ³² , since integer overflow is common here. Note that if we have two strings s ₁ and s ₂ with known hash codes, then the hash code of concatenation of these strings will be equal to

Here the algorithm is already pecking. If we want to make a string with a given hashcode of two parts, the second part can be selected so that it complements the hashcode to the desired value. We'll have to run through all the possible lengths of the second part, but it's still much faster than going through all the pairs.

We will generate phrases in this form:
"<___> [</>] <___>"
A preposition with spaces around it or simply a space is called infix. The first word will be s ₁ , and the infix and the second will be s ₂ . Base words take, for example, here (file litf-win.txt). The file looks like this:

  21715  2  1  1  2  1  7 ...

The number does not interest us, and we will cut it off. In addition, we throw out words shorter than three characters:

 List<String> words = Files.readAllLines(Paths.get("litf-win.txt"), Charset.forName("cp1251")).stream() .map(s -> s.substring(0, s.indexOf(' '))) .filter(s -> s.length() > 2) .collect(Collectors.toList());

Prepositions we type manually:

 String[] preps = { "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "" };

Create a list of infixes: add spaces around prepositions and a separate space:

 List<String> infixes = Stream.concat(Stream.of(" "), Arrays.stream(preps).map(p -> ' '+p+' ')) .collect(Collectors.toList());

Create a stream of all possible combinations of infixes and words (lines s ₂ ):

 words.stream().flatMap(s -> infixes.stream().map((String infix) -> infix+s))

Now we will form an associative array from this stream (string length -> (hashcode -> string)). In Java 8, this is much easier than in previous versions:

 Map<Integer, Map<Integer, List<String>>> lenHashSuffix = words.stream() .flatMap(s -> infixes.stream().map((String infix) -> infix+s)) .collect(Collectors.groupingBy(String::length, Collectors.groupingBy(String::hashCode)));

Let's make a stream for s ₁ - words with a capital letter (alas, there is still no ready method for this):

 words.stream().map(s -> Character.toTitleCase(s.charAt(0)) + s.substring(1))

To match s _{1 with} all sorts of options for s ₂ , you can use flatMap . It remains to lenHashSuffix over all lengths from lenHashSuffix , calculate the appropriate hashcode for s _2, and extract the lines with this hashcode. Here the question arises: how for a given len length to calculate h ( s ₁ ) · 31 ^len ? The Math.pow method does not work: it works with fractional numbers. It would be possible to write a for loop, but this is inconsistent! Fortunately, we have reduce!

 int hash = IntStream.range(0, len).reduce(s.hashCode(), (a, i) -> a*31);

Denote the target hashcode by the target. Then for each entry from lenHashSuffix stream of the s ₂ rows that satisfy us can be obtained as follows:

 entry.getValue().getOrDefault( target - IntStream.range(0, entry.getKey()).reduce(s.hashCode(), (a, i) -> a*31), Collections.emptyList()).stream()

It remains to glue s ₁ and s ₂ , sort alphabetically and output to the console:

 words.stream() .map(s -> Character.toTitleCase(s.charAt(0)) + s.substring(1)) .flatMap(s -> lenHashSuffix.entrySet().stream() .flatMap(entry -> entry.getValue().getOrDefault( target - IntStream.range(0, entry.getKey()).reduce(s.hashCode(), (a, i) -> a*31), Collections.emptyList()).stream().map(suffix -> s+suffix))) .sorted().forEach(System.out::println);

That's the whole program.

Full source code

 import java.nio.charset.Charset; import java.nio.file.*; import java.util.*; import java.util.stream.*; public class PhraseHashCode { public static void main(String[] args) throws Exception { int target = Integer.MIN_VALUE; String[] preps = { "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "" }; List<String> infixes = Stream.concat(Stream.of(" "), Arrays.stream(preps).map(p -> ' '+p+' ')) .collect(Collectors.toList()); List<String> words = Files.readAllLines(Paths.get("litf-win.txt"), Charset.forName("cp1251")).stream() .map(s -> s.substring(0, s.indexOf(' '))) .filter(s -> s.length() > 2) .collect(Collectors.toList()); Map<Integer, Map<Integer, List<String>>> lenHashSuffix = words.stream() .flatMap(s -> infixes.stream().map((String infix) -> infix+s)) .collect(Collectors.groupingBy(String::length, Collectors.groupingBy(String::hashCode))); words.stream() .map(s -> Character.toTitleCase(s.charAt(0)) + s.substring(1)) .flatMap(s -> lenHashSuffix.entrySet().stream() .flatMap(entry -> entry.getValue().getOrDefault( target - IntStream.range(0, entry.getKey()).reduce(s.hashCode(), (a, i) -> a*31), Collections.emptyList()).stream().map(suffix -> s+suffix))) .sorted().forEach(System.out::println); } }

results

The program fulfills seconds in ten and gives more than a hundred options. Of course, most of them are grammatically inconsistent or just boring. But quite good too come across.

So, if you want to make a report that the hashcode for certain lines is recalculated each time, it can be illustrated with the following examples (target = 0):

 "  " " " "  " "  " "  " "  "

If you are going to tell your colleagues why it is harmful to calculate Math.abs from the hashcode, you will need the following lines (target = Integer.MIN_VALUE):

 "  " "  " "  " "  " "  "

By changing the target value, you can generate phrases for any other hashcode.

Source: https://habr.com/ru/post/252671/

All Articles

Pseudo-identification of the cello

results

More articles: