In this article, I will discuss how to solve the classical problem in the SPL programming language: get a list of the most common words in the text. As a sample of the text we take the product of Shakespeare
Hamlet .
Further, I will immediately give an example of the program and the result of its work, and then we will analyze everything in detail step by step.
Program text:
text = #.readtext("hamlet.txt") words = #.split(text, " ", ".", ",", ";", "'", "!", "?", "-", "(", ")", "[", "]", #.crlf, #.quot) > i, 1..#.size(words) >> words[i] = "" key = #.lower(words[i]) dict[key] += 1 total += 1 < #.sortval(dict) #.reverse(dict) #.output(total, " ; ", #.size(dict), " ") > i, 1..10 key = dict[i] #.output(i, " : ", key, " = ", dict[key]) <
The result of the program:
')
32885 ; 4634 1 : the = 1091 2 : and = 969 3 : to = 767 4 : of = 675 5 : i = 633 6 : a = 571 7 : you = 558 8 : my = 520 9 : in = 451 10 : it = 421
Now we will discuss in more detail how it works.
First line:
text = #.readtext("hamlet.txt")
reads the text of the file “hamlet.txt” into the variable “text”.
Then in the line:
words = #.split(text, " ", ".", ",", ";", "'", "!", "?", "-", "(", ")", "[", "]", #.crlf, #.quot)
the "# .split" function divides the text "text" into separate words using the specified separators and saves the result in an array of "words". The list of delimiters also contains system constants "# .crlf" and "# .quot", which denote the characters CRLF (newline) and quotation marks ".
Next comes the cycle that starts with the ">" command. In the first line of the loop:
> i, 1..#.size(words)
it is indicated that it will cycle through the variable “i”, which changes from 1 to the number of words in the array “words”, which is returned by the function "# .size".
In the next line:
>> words[i] = ""
there is a command to go to the beginning of the cycle ">>", provided that the next word "words [i]" is not empty. This is in order not to take into account the empty words that turned out when dividing the text.
Then, in the “key” text variable, we get the next word in lower case thanks to the "# .lower" function:
key = #.lower(words[i])
and the following line:
dict[key] += 1
performs the main work - in the entry from the dictionary “dict” by the key “key” 1 is added, thus counting the number of each word.
In line:
total += 1
the total number of words that were taken into account is calculated, and the result is stored in the variable “total”.
Next line:
<
this is the end of the cycle.
Now sort the dict dictionary by value:
#.sortval(dict)
Sorting is performed in ascending order, therefore in the following line:
#.reverse(dict)
The dictionary is reversed in reverse order, descending.
In principle, the work is done, you need to print the result. The following line gives some statistics:
#.output(total, " ; ", #.size(dict), " ")
where the size of the dict dictionary returned by the "# .size" function gives us the number of unique words.
Next cycle:
> i, 1..10
displays the 10 most frequently used words.
This line is:
key = dict[i]
gets the next key of the dictionary, which is the word,
and the following line:
#.output(i, " : ", key, " = ", dict[key])
prints this word and how many times it appears in the text. Thus, reference to the dictionary by a numeric index returns us the key of the entry with this ordinal index, and reference to the dictionary by the text key returns us the value of the entry, which is the number - how many times this word was encountered in the text.
Last command:
<
closes the loop.
As you can see from this example, SPL fully automatically determines the type of all objects - numeric and text variables, arrays, as well as other objects. When working with the dictionary, new entries are added automatically.
Thank you for your attention and success in programming!