Let's beat Ruby together! Drop six

Today, we will create our first full-fledged application on Ruby, along the way learning new methods of the String and File classes and regular expressions.

Our application: Text Analyzer

The program itself is simple: it will read a text file, analyze it using some patterns, read the statistics and display the result. Ruby is great for analyzing documents and texts using regular expressions and the scan and split methods. In this application, we will focus on simple and fast programming and will not organize an object-oriented structure.

Main features

Here is a list of features that we need to implement:

character count
counting characters without spaces
line counting
word counting
paragraph counting
average words in a sentence
average number of sentences per paragraph

Implementation

At the beginning of the development of a new program, it is useful to provide key steps. Let's select the main steps:

Download the file containing the text we want to analyze.
Since we load the file by lines, we will immediately count their number for the necessary statistics.
Paste all the text into one line and measure its length to count the characters.
Temporarily remove all spaces and calculate the length of the new line without them.
Split the line by spaces to find out the number of words.
Break a line by punctuation marks to count the number of sentences
Sort by line breaks to find out the number of paragraphs.
Calculate to find averages for statistics.

Create a new, empty source file and save it as analyzer.rb .
')

Looking for some text

Before you start writing code, you need to find a piece of text for tests. In our example, we will use the first part of the Oliver Twist story, which you can download here . Save the file as text.txt in the same place as analyzer.rb

File loading and row counting

It's time for coding! The first step is to download the file. Ruby provides a sufficient list of methods for file manipulations in the File class. Here is the code that will open our text.txt :

File.open("text.txt").each { |line| puts line }

Enter the code in analyzer.rb and run the program. As a result, you will see text strings running around the screen.

You request the File class to open text.txt , and then, as is the case with arrays, call the each method directly on the file, forcing each line to be passed one by one to the internal block, where puts sends them to the screen. Edit the code to look like this:

line_count = 0
File.open("text.txt").each { |line| line_count += 1 }
puts line_count

You define the line_count variable to store the line count in it, then open the file and iterate over the lines, increasing the line_count by 1 each time. At the end, you display the result on the screen (about 127 in our example). We have the first piece for statistics!

You have counted the lines, but we still cannot calculate the words, sentences, paragraphs. It is easy to fix. Let's change the code a bit and add the text variable to collect all the lines into it on the fly:

 text='' line_count = 0 File.open("text.txt").each do |line| line_count += 1 text << line end puts "#{line_count} lines"

Unlike the previous code, this represents the text variable and adds each line to it. When the iteration is over, all our text is in the text.

It would seem that the code is as concise and simple as possible, however, there are also other methods in File that can be used in our case. For example, we can rewrite our code like this:

lines = File.readlines("text.txt")
line_count = lines.size
text = lines.join

puts "#{line_count} lines"

Much easier! The readlines method reads the entire file into an array , line by line.

Counting characters

Since we have collected the entire file in text , we can use the length method that is applicable to strings, which returns the size (number of characters) in the string, and, accordingly, in all of our text. We will add the code:

total_characters = text.length
puts "#{total_characters} characters"

Another element of statistics that we need is the counting of characters without spaces. To do this, we use the method of replacing characters. Here is an example:

puts "foobar".sub('bar', 'foo') #foofoo

The sub method found the character set passed in the first parameter and replaced them with the characters from the second. However, the sub finds and modifies only one, the first occurrence of characters, the gsub method performs all possible replacements at once.

Regular expressions

What about replacing more complex patterns? For this, regular expressions are used. Entire books have been written on this topic, so we will limit ourselves to a brief overview of this powerful text tool. In Ruby, regular expressions are created by enclosing a pattern between slashes (/ pattern /). And in Ruby, of course, these are also objects. For example, you can specify the following pattern to select lines containing Perl text or Python text: /Perl|Python/ . In slashes we have our pattern consisting of two words we need, separated by a straight line (pipe, pipe, |). This symbol means "either that which is on the left, or that which is on the right." You can also use parentheses, as in numerical expressions: /P(erl|ython)/ .

In the patterns, you can implement the repetition: /ab+c/ (this is not addition, we consider as a , then b+ , then ). Such a pattern corresponds to a string containing the occurrence of a , then one or more b , and finally c . Replace the plus with an asterisk, now / ab * c / matches a string containing a , zero or more b and c . + and * are the so-called quantifiers , whose purpose, I think, is clear

We can also select strings containing certain characters. The simplest example is patterns of classes of characters, for example, \ s means whitespace (these are spaces, tabs, line breaks, etc.), all numbers fall under \ d, and others. Here is a summary taken from the Ruby textbook on Wikibooks :

We continue to read characters

So, now we know how to remove all unnecessary characters from a string:

total_characters_nospaces = text.gsub(/\s+/, '').length
puts "#{total_characters_nospaces} characters excluding spaces"

Add this code to the end of our file and proceed to the counting of words.

Count the words

To count the number of words, there are two approaches:

Calculate the number of groups of continuous characters using the scan method
Split the text into whitespace characters and count the resulting fragments using split and size .

Let's go on the second path. By default (without parameters) split will split string into whitespace and put the fragments into an array. We only need to know the length of the array. We add the code:

word_count = text.split.length
puts "#{word_count} words"

We count sentences and paragraphs.

If we understand how the counting of symbols was implemented, then there will be no difficulties with sentences and paragraphs. The only difference in the pattern by which we break the text. For a sentence, this is a period, question and exclamation marks, for paragraphs - double line break. Here is the code:

paragraph_count = text.split(/\n\n/).length
puts "#{paragraph_count} paragraphs"
sentence_count = text.split(/\.|\?|!/).length
puts "#{sentence_count} sentences"

I think the code is clear. The only difficulty can make a pattern for proposals. However, he only looks scary. We can not just put characters . and ? - we “shield” them with a slash.

Counting Other Values

We already have the number of words, paragraphs and sentences in the word_count variables,
paragraph_count and sentence_count respectively, therefore only arithmetic works further:

puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"

Source

We supplemented the source code step by step, so the logic and the output on the screen we mixed. Let's put everything in its place. Further, only cosmetic changes:

lines = File.readlines("text.txt")
line_count = lines.size
text = lines.join
word_count = text.split.length
character_count = text.length
character_count_nospaces = text.gsub(/\s+/, '').length
paragraph_count = text.split(/\n\n/).length
sentence_count = text.split(/\.|\?|!/).length

puts "#{line_count} lines"
puts "#{character_count} characters"
puts "#{character_count_nospaces} characters excluding spaces"
puts "#{word_count} words"
puts "#{paragraph_count} paragraphs"
puts "#{sentence_count} sentences"
puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"

If everything nakodnoe above you understand - my congratulations! So we knowingly "dripped";)

Epilogue

Another big drop. Written more for very beginner programmers, for those who know other PLs is a good opportunity to compare Ruby’s abilities. I suggest that from time to time, I often continue to do similar issues with the analysis of ready-made solutions. Thanks for the example to Peter Cooper! Waiting for feedback and comments! We are waiting for the next drop;)

Source: https://habr.com/ru/post/48961/

All Articles