📜 ⬆️ ⬇️

Use a hash search, not an array search

A rather frequent task is to check whether the string is the same as the other rows in the set. For example, you need to check every word in a forum message to see if it is in the list of prohibited ones. A common solution: create an array with a list of forbidden words, and then use the in_array() function to do a check. There are ways to improve the performance of such an algorithm.

Array search


Usually the check happens like this:

 <?php $words = get_all_words_in_text($text); $badWords = ['$$$$', '@#$%', 'crud' /** ... */ ]; foreach ($words as $word) { if (in_array(strtolower($word), $badWords)) { echo 'Found bad word: ' . $word . "\n"; } } 


This method solves the problem, but it is not the most effective. Pass through the array of words in the message and check each for being in the list of prohibited by the function in_array() . In PHP, the algorithm by which the function in_array() is implemented has linear complexity - O (n). This means that with an increase in the list of bad words, the work time will increase proportionally. We can come up with something better.
')

Search by hash


Since the list of bad words is known in advance, you can rework the way of comparison so that it will have a constant complexity that does not depend on the number of prohibited words in the list. For this you can use associative arrays. As in the case of hash tables, the speed of searching for a key in an array is O (1), except for some cases that will not arise in our situation.

If we change the structure of an array of bad words so that its values ​​become keys, and the values ​​for these keys become just true , we can use the isset() function and get a significant speed increase.

 <?php $words = get_all_words_in_text($text); $badWords = [ '$$$$' => true, '@#$%' => true, 'crud' => true // ... ]; foreach ($words as $word) { if (isset($badWords[strtolower($word)])) { echo 'Found bad word: ' . $word . "\n"; } } 


Performance test


Let's try to test a new way. I wrote a simple test that will show the elapsed time on the same data set for the methods: “array search” and “search by hash”.

 <?php $total = 10000; $paragraph = 'this is a sentence. Crud! $$$$!'; $words = explode(' ', $paragraph); $badWordList = ['$$$$', '@#$%', 'crud', 'fud', 'fudd', 'dud']; $s = microtime(true); for ($j = 0; $j < $total; $j++) { foreach ($words as $word) { in_array(strtolower($word), $badWordList); } } echo "in_array: " . (microtime(true) - $s) . "\n"; $badWordHash = [ '$$$$' => true, '@#$%' => true, 'crud' => true, 'fud' => true, 'fudd' => true, 'dud' => true ]; $s = microtime(true); for ($j = 0; $j < $total; $j++) { foreach ($words as $word) { isset($badWordHash[strtolower($word)]); } } echo "hash: " . (microtime(true) - $s) . "\n"; 


As you can see from the listing, 10,000 repetitions are used in the test. The result is:

 in_array: 0.033491134643555 hash: 0.0069370269775391 


As you can see, the search by hash gave an increase of 480% compared with the search of the array.

It is important to understand that as the number of forbidden words increases, the time required for searching an array with the in_array() function in_array() . But isset() does not depend on the number of elements, and its operation time will remain constant. Let me show you what I mean. In the following example, the list of forbidden words will consist of 10,000 elements.

 <?php $total = 10000; $paragraph = 'this is a sentence. Crud! $$$$!'; $words = explode(' ', $paragraph); //     $sequence = []; for ($j = 0; $j < 10000; $j++) { $sequence[] = 'a' . $j; } $s = microtime(true); for ($j = 0; $j < $total; $j++) { foreach ($words as $word) { in_array(strtolower($word), $sequence); } } echo "in_array: " . (microtime(true) - $s) . "\n"; //     $hash = array_fill_keys($sequence, true); $s = microtime(true); for ($j = 0; $j < $total; $j++) { foreach ($words as $word) { isset($hash[strtolower($word)]); } } echo "hash: " . (microtime(true) - $s) . "\n"; 


The difference in speed is much better visible. A hash search is 3 162 percent faster than an array search.

 in_array: 20.464313983917 hash: 0.0064699649810791 


Generally, there is nothing new


This is not a new idea. This is a fairly common approach in many languages. I recently realized all of a sudden that I constantly use hash searching for such tasks while reading the book “ Programming on Lua ”.

The next time you use in_array() to check, think about whether the work will speed up if you use the isset() function on the keys of the associative array instead.

Source: https://habr.com/ru/post/216865/


All Articles