Peace and love to fellow IT people from the humanities!
I decided to talk about how IT is used (with benefit!) In the humanities.

')
Million books is the name of the Google book digitization project, the results of which are seen by everyone in Google Books. A million books were successfully transferred to electronic format in 2007. Now Google’s new challenge is to digitize 30 million books.
And the humanities got a new question: what now to do with all this sea of literature? What to do with the millions of books that are published in our time?
First, it is clear that you cannot read a million books.
Secondly, it is clear that the humanities should read it.
After all, the fundamental difference between a humanist and a natural scientist is the obligation to be aware of the entire volume of fiction. Suppose you did not read Kalevala, but you have to imagine what it is and how it is.
What to do?
Of course, to call for help new technologies. First of all - data mining. For this,
the MONK project has been launched at Northwestern University and the University of Illinois.
MONK consists of a database and programs that detect repeating patterns in texts. The MorphAdorner program keeps track of links between individual words and sentences, parts of speech, and lexemes. It also takes into account a variety of dialects. The program is capable of learning and self-learning, the classification of texts and the calculation of probabilities (for example, by the frequency of occurrence of a word in several texts, calculate the probability of occurrence of text in the following). Thus, using this tool, you can get a kind of DNA of any text.
It is possible to detect the basic linguistic pattern of groups of texts united by one feature: for example, the DNA of texts written by women between 1790 and 1900 looks like this:

And the DNA of the texts written by men of the same period is like this:

Now MONK has high hopes. For example, with its help they hope to determine the authorship of doubtful texts, find out the year of writing the text, and even the gender of the author. And of course - it just eliminates the need to read a million books to be aware of what is written in them.
When writing, the following source was freely used: Tanya Clement et al.