Today, we will create our first full-fledged application on Ruby, along the way learning new methods of the
String
and
File
classes and regular expressions.
Our application: Text Analyzer
The program itself is simple: it will read a text file, analyze it using some patterns, read the statistics and display the result. Ruby is great for analyzing documents and texts using regular expressions and the
scan
and
split
methods. In this application, we will focus on simple and fast programming and will not organize an object-oriented structure.
Main features
Here is a list of features that we need to implement:
- character count
- counting characters without spaces
- line counting
- word counting
- paragraph counting
- average words in a sentence
- average number of sentences per paragraph
Implementation
At the beginning of the development of a new program, it is useful to provide key steps. Let's select the main steps:
- Download the file containing the text we want to analyze.
- Since we load the file by lines, we will immediately count their number for the necessary statistics.
- Paste all the text into one line and measure its length to count the characters.
- Temporarily remove all spaces and calculate the length of the new line without them.
- Split the line by spaces to find out the number of words.
- Break a line by punctuation marks to count the number of sentences
- Sort by line breaks to find out the number of paragraphs.
- Calculate to find averages for statistics.
Create a new, empty source file and save it as
analyzer.rb
.
')
Looking for some text
Before you start writing code, you need to find a piece of text for tests. In our example, we will use the first part of the
Oliver Twist story, which you can download
here . Save the file as
text.txt
in the same place as
analyzer.rb
File loading and row counting
It's time for coding! The first step is to download the file. Ruby provides a sufficient list of methods for file manipulations in the
File
class. Here is the code that will open our
text.txt
:
File.open("text.txt").each { |line| puts line }
Enter the code in
analyzer.rb
and run the program. As a result, you will see text strings running around the screen.
You request the
File
class to open
text.txt
, and then, as is the case with arrays, call the
each
method directly on the file, forcing each line to be passed one by one to the internal block, where
puts
sends them to the screen. Edit the code to look like this:
line_count = 0
File.open("text.txt").each { |line| line_count += 1 }
puts line_count
You define the
line_count
variable to store the line count in it, then open the file and iterate over the lines, increasing the
line_count
by
1
each time. At the end, you display the result on the screen (about 127 in our example). We have the first piece for statistics!
You have counted the lines, but we still cannot calculate the words, sentences, paragraphs. It is easy to fix. Let's change the code a bit and add the
text
variable to collect all the lines into it on the fly:
text='' line_count = 0 File.open("text.txt").each do |line| line_count += 1 text << line end puts "#{line_count} lines"
Unlike the previous code, this represents the
text
variable and adds each line to it. When the iteration is over, all our text is in the text.
It would seem that the code is as concise and simple as possible, however, there are also other methods in
File
that can be used in our case. For example, we can rewrite our code like this:
lines = File.readlines("text.txt")
line_count = lines.size
text = lines.join
puts "#{line_count} lines"
Much easier! The
readlines
method reads the entire file into an
array , line by line.
Counting characters
Since we have collected the entire file in
text
, we can use the
length
method that is applicable to strings, which returns the size (number of characters) in the string, and, accordingly, in all of our text. We will add the code:
total_characters = text.length
puts "#{total_characters} characters"
Another element of statistics that we need is the counting of characters without spaces. To do this, we use the method of replacing characters. Here is an example:
puts "foobar".sub('bar', 'foo') #foofoo
The
sub
method found the character set passed in the first parameter and replaced them with the characters from the second. However, the
sub
finds and modifies only one, the first occurrence of characters, the
gsub
method performs all possible replacements at once.
Regular expressions
What about replacing more complex patterns? For this,
regular expressions are used. Entire books have been written on this topic, so we will limit ourselves to a brief overview of this powerful text tool. In Ruby, regular expressions are created by enclosing a pattern between slashes (/ pattern /). And in Ruby, of course, these are also objects. For example, you can specify the following pattern to select lines containing
Perl text or
Python text:
/Perl|Python/
. In slashes we have our pattern consisting of two words we need, separated by a straight line (pipe, pipe, |). This symbol means "either that which is on the left, or that which is on the right." You can also use parentheses, as in numerical expressions:
/P(erl|ython)/
.
In the patterns, you can implement the repetition:
/ab+c/
(this is not addition, we consider as
a
, then
b+
, then
). Such a pattern corresponds to a string containing the occurrence of
a , then one or more
b , and finally
c . Replace the plus with an asterisk, now / ab * c / matches a string containing
a , zero or more
b and
c .
+
and
*
are the so-called
quantifiers , whose purpose, I think, is clear
We can also select strings containing certain characters. The simplest example is patterns of classes of characters, for example, \ s means whitespace (these are spaces, tabs, line breaks, etc.), all numbers fall under \ d, and others. Here is a summary taken from the Ruby textbook on Wikibooks :

We continue to read characters
So, now we know how to remove all unnecessary characters from a string:
total_characters_nospaces = text.gsub(/\s+/, '').length
puts "#{total_characters_nospaces} characters excluding spaces"
Add this code to the end of our file and proceed to the counting of words.
Count the words
To count the number of words, there are two approaches:
- Calculate the number of groups of continuous characters using the
scan
method - Split the text into whitespace characters and count the resulting fragments using
split
and size
.
Let's go on the second path. By default (without parameters)
split
will
split
string into whitespace and put the fragments into an array. We only need to know the length of the array. We add the code:
word_count = text.split.length
puts "#{word_count} words"
We count sentences and paragraphs.
If we understand how the counting of symbols was implemented, then there will be no difficulties with sentences and paragraphs. The only difference in the pattern by which we break the text. For a sentence, this is a period, question and exclamation marks, for paragraphs - double line break. Here is the code:
paragraph_count = text.split(/\n\n/).length
puts "#{paragraph_count} paragraphs"
sentence_count = text.split(/\.|\?|!/).length
puts "#{sentence_count} sentences"
I think the code is clear. The only difficulty can make a pattern for proposals. However, he only looks scary. We can not just put characters
.
and
?
- we “shield” them with a slash.
Counting Other Values
We already have the number of words, paragraphs and sentences in the word_count variables,
paragraph_count and sentence_count respectively, therefore only arithmetic works further:
puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"
Source
We supplemented the source code step by step, so the logic and the output on the screen we mixed. Let's put everything in its place. Further, only cosmetic changes:
lines = File.readlines("text.txt")
line_count = lines.size
text = lines.join
word_count = text.split.length
character_count = text.length
character_count_nospaces = text.gsub(/\s+/, '').length
paragraph_count = text.split(/\n\n/).length
sentence_count = text.split(/\.|\?|!/).length
puts "#{line_count} lines"
puts "#{character_count} characters"
puts "#{character_count_nospaces} characters excluding spaces"
puts "#{word_count} words"
puts "#{paragraph_count} paragraphs"
puts "#{sentence_count} sentences"
puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"
If everything nakodnoe above you understand - my congratulations! So we knowingly "dripped";)
Epilogue
Another big drop. Written more for very beginner programmers, for those who know other PLs is a good opportunity to compare Ruby’s abilities. I suggest that from time to time, I often continue to do similar issues with the analysis of ready-made solutions. Thanks for the example to Peter Cooper! Waiting for feedback and comments! We are waiting for the next drop;)