📜 ⬆️ ⬇️

Learn English with Scala on Future and Actor

I decided to tighten up my English here. In particular, I wanted to significantly expand the vocabulary. I know that there are a lot of programs that help to do this in a game form. The catch is that I don't like gaming. I prefer the old fashioned way. A piece of paper where the table with the words, transcription and translation. And we teach him to teach. And we check our knowledge, for example, by closing the column with the translation. In general, as I taught it at the university.

I heard about the fact that there are 3000 of the most frequently used words, selected on the OxfordDictionary site. Here is this list of words: www.oxfordlearnerserstiontion.com/wordlist/english/oxford3000/Oxford3000_A-B Well, I decided to take the Russian translation from here: www.translate.ru/dictionary/en-ru One problem only, everything is on these sites well, not at all in the format that can be printed and taught. As a result, the idea was born to program it all. But to do this not as a sequential algorithm, but to separate everything. What would pumping and parsing all the words took not (3000 words * 2 sites) / 60 seconds = 100 minutes. This is if you give 1 second to pump out and rasparsivanija pages to extract the translation and transcription (in reality, I think it is 3 times longer, while we open the connection, while we close, etc. and etc.).

image

')
I broke the task into two large blocks at once. The first block is blocking I / O - downloading a page from the site. The second block is computational operations, not blocking, but loading the CPU: parsing the page to extract the translation and transcription and add the results of the parsing to the dictionary.

I decided to do blocking operations in the thread pool using Future from Scala. Computational problems, I decided to scatter Akka by 3 actors. Using the TDD technique, at first I wrote a test to my bricks of the future application.

class Test extends FlatSpec with Matchers { "Table Of Content extractor" should "download and extract content from Oxford Site" in { val content:List[String] = OxfordSite.getTableOfContent content.size should be (10) content.find(_ == "AB") should be (Some("AB")) content.find(_ == "UZ") should be (Some("UZ")) } "Words list extractor" should "download words from page" in { val future: Future[Try[Option[List[String]]]] = OxfordSite.getWordsFromPage("AB", 1) val wordsTry:Try[Option[List[String]]] = Await.result(future,60 seconds) wordsTry should be a 'success val words = wordsTry.get words.get.find(_ == "abandon") should be (Some("abandon")) } "Words list extractor" should "return None from empty page" in { val future: Future[Try[Option[List[String]]]] = OxfordSite.getWordsFromPage("AB", 999) val wordsTry:Try[Option[List[String]]] = Await.result(future,60 seconds) wordsTry should be a 'success val words = wordsTry.get words should be(None) } "Russian Translation" should "download translation and parse" in { val page: Future[Try[String]] = LingvoSite.getPage("test") val pageResultTry: Try[String]= Await.result(page,60 seconds) pageResultTry should be a 'success val pageResult = pageResultTry.get pageResult.contains("") should be(true) LingvoSite.parseTranslation(pageResult).get should be("") } "English Translation" should "download translation and parse" in { val page: Future[Try[String]] = OxfordSite.getPage("test") val pageResultTry: Try[String] = Await.result(page,60 seconds) pageResultTry should be a 'success val pageResult = pageResultTry.get pageResult.contains("examination") should be(true) OxfordSite.parseTranslation(pageResult).get should be(("test", "an examination of somebody's knowledge or ability, consisting of questions for them to answer or activities for them to perform")) } } 


Note. Functions that can return the result of calculations have Try [...]. Either Success or Result or Failure and Exception. Functions that will call often and have blocking i / o operations have a result, like Future [Try [...]]. Those when calling the function immediately returns Future in which there are long i / o operations. Moreover, they go inside Try and can end with errors (for example, the connection is broken).

The application itself is initialized in Top3000WordsApp.scala. The system of actors rises. Actors are created. The parsing of the list of words is started, which in parallel launches the pumping out of the English and Russian pages with transcription and translation. In the case of a successful page jump, the transfer of the contents of the pages to the actors for parsing, extracting the translation and transcription, is triggered. The result of the transfer is transmitted by the actors to the final dictionary actor who accumulates all the results in one place. And by pressing enter, the system of actors goes into shutdown. And the actor DictionaryActor, receiving a signal about this, saves the collected dictionary to the file dictionaty.txt

 object Top3000WordsApp extends App { val system = ActorSystem("Top3000Words") val dictionatyActor = system.actorOf(Props[DictionaryActor], "dictionatyActor") val englishTranslationActor = system.actorOf(Props(classOf[EnglishTranslationActor], dictionatyActor), "englishTranslationActor") val russianTranslationActor = system.actorOf(Props(classOf[RussianTranslationActor], dictionatyActor), "russianTranslationActor") val mapGetPageThreadExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(16)) val mapGetWordsThreadExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(16)) start() scala.io.StdIn.readLine() system.terminate() def start() = { import concurrent.ExecutionContext.Implicits.global Future { OxfordSite.getTableOfContent.par.foreach(letterGroup => { getWords(letterGroup, 1) }) } } def getWords(letterGroup: String, pageNum: Int): Unit = { implicit val executor = mapGetWordsThreadExecutionContext OxfordSite.getWordsFromPage(letterGroup, pageNum).map(tryWords => { tryWords match { case Success(Some(words)) => words.par.foreach(word => { parse(word,letterGroup,pageNum) }) case Success(None) => Unit case Failure(ex) => println(ex.getMessage) } }) } def parse(word: String, letterGroup: String, pageNum: Int)= { implicit val executor = mapGetPageThreadExecutionContext OxfordSite.getPage(word).map(tryEnglishPage => { tryEnglishPage match { case Success(englishPage) => { englishTranslationActor ! (word, englishPage) getWords(letterGroup, pageNum + 1) } case Failure(ex) => println(ex.getMessage) } }) LingvoSite.getPage(word).map(_ match { case Success(russianPage) => { russianTranslationActor !(word, russianPage) } case Failure(ex) => println(ex.getMessage) }) } } 


Please note that the algorithm is divided into the start, getWords, parse functions. This is done because each phase of the task requires its own thread pool, which is passed implicitly, like the ThreadExecutionContext. At first, I had only one getWords function, for a recursive call. But everything worked very slowly, since at the top level of the algorithm, threading rested the entire pool of threads and at the very bottom there were eternal expectations, when would they give me a free thread to work. And just at the bottom of the largest number of operations.

Here is the implementation of downloading and parsing from sites.

 object OxfordSite { val getPageThreadExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(16)) def parseTranslation(content: String): Try[(String, String)] = { Try { val browser = new Browser val doc = browser.parseString(content) val spanElement: Element = doc >> element(".phon") val str = Jsoup.parse(spanElement.toString).text() val transcription = str.stripPrefix("BrE//").stripSuffix("//").trim val translation = doc >> text(".def") (transcription,translation) } } def getPage(word: String): Future[Try[String]] = { implicit val executor = getPageThreadExecutionContext Future { Try { val html = Source.fromURL("http://www.oxfordlearnersdictionaries.com/definition/english/" + (word.replace(' ','-')) + "_1") html.mkString } } } def getWordsFromPage(letterGroup: String, pageNum: Int): Future[Try[Option[List[String]]]] = { import ExecutionContext.Implicits.global Future { Try { val html = Source.fromURL("http://www.oxfordlearnersdictionaries.com" + "/wordlist/english/oxford3000/Oxford3000_" + letterGroup + "/?page=" + pageNum) val page = html.mkString val browser = new Browser val doc = browser.parseString(page) val ulElement: Element = doc >> element(".wordlist-oxford3000") val liElements: List[Element] = ulElement >> elementList("li") if (liElements.size > 0) Some(liElements.map(_ >> text("a"))) else None } } } def getTableOfContent: List[String] = { val html = Source.fromURL("http://www.oxfordlearnersdictionaries.com/wordlist/english/oxford3000/Oxford3000_A-B/") val page = html.mkString val browser = new Browser val doc = browser.parseString(page) val ulElement: Element = doc >> element(".hide_phone") val liElements: List[Element] = ulElement >> elementList("li") List(liElements.head >> text("span")) ++ liElements.tail.map(_ >> text("a")) } } object LingvoSite { val getPageThreadExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(16)) def parseTranslation(content: String): Try[String] = { Try { val browser = new Browser val doc = browser.parseString(content) val spanElement: Element = doc >> element(".r_rs") spanElement >> text("a") } } def getPage(word: String): Future[Try[String]] = { implicit val executor = getPageThreadExecutionContext Future { Try { val html = Source.fromURL("http://www.translate.ru/dictionary/en-ru/" + java.net.URLEncoder.encode(word,"UTF-8")) html.mkString } } } } 


The data structures that the actors work with.

 case class Word (word: String, transcription: Option[String] = None, russianTranslation:Option[String] = None, englishTranslation: Option[String] = None) case class RussianTranslation(word:String, translation: String) case class EnglishTranslation(word:String, translation: String) case class Transcription(word:String, transcription: String) 


Actors who take in the downloaded pages for parsing and forward the translation and transcription to the actor DictionaryActor

 class EnglishTranslationActor (dictionaryActor: ActorRef) extends Actor { println("EnglishTranslationActor") def receive = { case (word: String, englishPage: String) => { OxfordSite.parseTranslation(englishPage) match { case Success((transcription, translation)) => { dictionaryActor ! EnglishTranslation(word,translation) dictionaryActor ! Transcription(word,transcription) } case Failure(ex) => { println(ex.getMessage) } } } } } class RussianTranslationActor (dictionaryActor: ActorRef) extends Actor { println("RussianTranslationActor") def receive = { case (word: String, russianPage: String) => { LingvoSite.parseTranslation(russianPage) match { case Success(translation) => { dictionaryActor ! RussianTranslation(word, translation) } case Failure(ex) => { println(ex.getMessage) } } } } } 


An actor who accumulates a dictionary with translations and transcriptions and after the shutdown system of actors writes the entire dictionary in dictionary.txt

 class DictionaryActor extends Actor { println("DictionaryActor") override def postStop(): Unit = { println("DictionaryActor postStop") val fileText = DictionaryActor.words.map{case (_, someWord)=> { val transcription = someWord.transcription.getOrElse(" ") val russianTranslation = someWord.russianTranslation.getOrElse(" ") val englishTranslation = someWord.englishTranslation.getOrElse(" ") List(someWord.word, transcription , russianTranslation , englishTranslation).mkString("|") }}.mkString("\n") scala.tools.nsc.io.File("dictionary.txt").writeAll(fileText) println("dictionary.txt saved") System.exit(0) } def receive = { case Transcription(wordName, transcription) => { val newElement = DictionaryActor.words.get(wordName) match { case Some(word) => word.copy(transcription = Some(transcription)) case None => Word(wordName,transcription = Some(transcription)) } DictionaryActor.words += wordName -> newElement println(newElement) } case RussianTranslation(wordName, translation) => { val newElement = DictionaryActor.words.get(wordName) match { case Some(word) => word.copy(russianTranslation = Some(translation)) case None => Word(wordName,russianTranslation = Some(translation)) } DictionaryActor.words += wordName -> newElement println(newElement) } case EnglishTranslation(wordName, translation) => { val newElement = DictionaryActor.words.get(wordName) match { case Some(word) => word.copy(englishTranslation = Some(translation)) case None => Word(wordName,englishTranslation = Some(translation)) } DictionaryActor.words += wordName -> newElement println(newElement) } } } object DictionaryActor { var words = scala.collection.mutable.Map[String, Word]() } 


What are the conclusions? On my Mac Book Pro, this script worked for about 1 hour while I was writing this article. I interrupted him by pressing enter and here is the result:

 bash-3.2$ cat ./dictionary.txt |wc -l 1809 


Then, I ran the script again and left it for several hours. When I returned, I had a processor loaded 100% and there were errors in the console about the garbazh collector, by pressing enter, my program could not save the result of its work to a file. Such a diagnosis, writing on Future and par.map or par.foreach is nice and convenient, of course, but it's really hard to understand how it all works at the thread level and where the bottle neck is. In the end, I plan to rewrite all the actors. Besides, I will use pools of actors. What, for example, 4 actors pumped out and parsed the pages with lists of words, 18 actors pumped out pages with translations, 4 actors read the pages, extracting translations and transcriptions, and 1 actor added everything to the dictionary.

The current implementation in branch v0.1 github.com/evgenyigumnov/top3000words/tree/v0.1 The version where everything is rewritten to actors with pools will be in brunch v0.2, well, in master, a little later. Can anyone have any thoughts on what I was doing wrong in the current version? Well, maybe tips throw on the new version?

The github project is available: github.com/evgenyigumnov/top3000words

Run project tests: sbt test
Run application: sbt run
Well, how tired of waiting, pressing enter and reading the contents of the file dictionary.txt in the current folder

PS
As a result, I made the final version v0.2, which parses 10 minutes in 30 threads. github.com/evgenyigumnov/top3000words/tree/v0.2
At the end of the enter no need to click. Everything is done on the actors. In Future, only blocking i / o wraps are heavy.

Source: https://habr.com/ru/post/273431/


All Articles