📜 ⬆️ ⬇️

Introduction to Go - we write a grabber of web pages with multithreading and harlots

Probably everyone heard about the Go language from the Google team. But not all have tried, and it is in vain - communication with ground squirrels Go is a sea of ​​pleasure, as I recently learned from my own experience.
Getting started with a new language is the funniest thing to do on a life example, so I, without hesitation, took the first task “from a life of the first importance”:

There is an Internet site http://vpustotu.ru on which anyone can anonymously speak about the painful. All statements (I will call them “quotations” in the future) first get into moderation (similar to the “abyss” of the bashorg), where visitors can evaluate the flight of thought and vote for a quote in the style of “Wow!” Or “Nonsense!”. On the moderation page ( http://vpustotu.ru/moderation/ ), we are shown a random quote, voting links and the “More” link that leads to the same page. Click, it's all very simple.

And then the problem arose - urgently, under the cover of darkness, download a full dump of all quotes on moderation for further secret research. We will not evaluate the everyday value and degree of idiocy of the task, but consider it from a technical point of view:
')
In the moderation section there are no direct links to a specific quote, the only way to get a new quote is to refresh the page (or follow the link “more”, which is the same). And repetitions are quite possible, which is easily detected after a couple of minutes of aggressive clipping.

Thus, you need a program that:



It is logical that we have no idea whether all the quotes are loaded, but this can be indirectly guessed by a large number of repeated quotes in a row. Therefore, we add:



Well, everything seems to be clear. Let the program keep two files - with quotes and with certain hashes of these quotes, so as not to repeat, and reread the file at the beginning of each launch. Well, then in the cycle it parses the page, pulling out more and more new revelations, until it receives ctrl-c on the forehead or meets a certain number of repetitions. The task is clear, there is a plan - let's go!

Open your favorite editor to create a new project. Initially, the code of an empty program looks like this:
package main import ( "fmt" ) func main() { fmt.Println("Hello World!") } 

We begin to go through the list and in the order of program execution:

We have several parameters like the number of “streams”, file names and more. All this beauty to get from the command line will help the built-in package “ flag “. Add it to the import list and, at the same time, declare the parameter variables:
 import ( "flag" "fmt" ) var ( WORKERS int = 2 //- "" REPORT_PERIOD int = 10 //  () DUP_TO_STOP int = 500 //    HASH_FILE string = "hash.bin" //   QUOTES_FILE string = "quotes.txt" //   ) 

The type of a variable in Go is written after its name, which may be a bit unusual. We also immediately initialize the default values ​​so that everything is convenient and in one place.

Things like parsing arguments are usually done in the init () function, and we’ll do this:
 func init() { //  : flag.IntVar(&WORKERS, "w", WORKERS, " ") flag.IntVar(&REPORT_PERIOD, "r", REPORT_PERIOD, "  ()") flag.IntVar(&DUP_TO_STOP, "d", DUP_TO_STOP, "-   ") flag.StringVar(&HASH_FILE, "hf", HASH_FILE, " ") flag.StringVar(&QUOTES_FILE, "qf", QUOTES_FILE, " ") //    flag.Parse() } 

UPD: Habrayuser Forked in a comment explains why it is a bad habit to call the flag.Parse () function in init. Thanks to him for that - from now on, I will do it in main (), which I advise you, but this example will remain here for the edification that “fu do that!”

Let's use the functions IntVar (for numbers) and StringVar (for strings) from the flag package - they read the specified key from the command line arguments and pass it to the variable. If the key is not specified, the default value is taken. The syntax of the functions is the same:
 flag.StringVar( &_,  . ,    ,  ) 

Please note that we pass a pointer to a variable (the & symbol) so that the function can modify it if necessary. Also interesting is the “key description” parameter - the fact is that the flag package automatically creates the argument legend available to us by the “-h” key. You can start the program right now and see this:
 C:\Go\src\habratest>habratest.exe -h Usage of habratest.exe: -d=500: -    -hf="hash.bin":   -qf="quotes.txt":   -r=10:   () -w=2:   

Further, according to the plan, hashes are loaded at the start of the program, but I propose to return to this later, when we already have something to load. Imagine that we have so far every launch is the first.

Now let's think about the program itself - it should read the page, disassemble it, record the results and analyze the progress. Yes, and in several "threads" (we will immediately define ourselves - I call the thread thread and not stream. Well, just in case). Ha!

Multitasking in Go is implemented by goroutines ( goroutine ), that is, we can start any function “in the background” by substituting the keyword “go” in front of it: it starts and the program execution immediately continues, it’s about running the command from & at the end in linux - we Immediately we can do something else:
 //   ... func GadgetUmbrella() { //...   ... } //...       ,   : go GadgetUmbrella() fmt.Println(" !") //          ,      GadgetUmbrella() 

In general, the gurutines are not pure flows, everything is much more interesting there, but such things are clearly beyond the scope of our task.

I propose to put in a separate “stream” a function that will, in an infinite loop, perform the loading and parsing of the page and give the finished quote, let's call it “grabber”. Let yourself be spinning in the background in several copies, while we are going to catch these quotes in a cycle from the main program, do what we need with them, and, in case of certain conditions, turn off the work. It sounds suspiciously simple, especially for those who already have experience of multi-threaded programming and are ready to shout in horror about the numerous nuances of sharing and synchronization.

But in reality, everything is really simple, because another great feature of Go is the channels. Simplistically, the channel is easy to imagine as a pipe into which, on the one hand, it is casting meaning and, on the other hand, it is caught. Channels are typed and require initialization - in short, working with them like this:
 func send( c chan int) {  <- 15 //     } func main() { ch := make(chan int) //      (int) go send(ch) b:= <-ch //    fmt.Println(b) } 

Note that I wrote exactly go send (ch) and not just send (ch) . The fact is that by default, sending and receiving data from the channel blocks the subroutine that has called them until the other end is ready to process the data. This is such a great synchronization tool.
It turns out that if you remove the " go " then send () will be executed in the main thread and block it, because it will wait until someone is ready to pick up our number 15. We will take it in the next line, but it will never be on it transferred control. Deadlock, all in sorrow.
But in the case of " go send () " everything goes smoothly, because send () is blocked, but in its stream, and will stand until we read in another - and we do it very soon and data exchange occurs successfully.
Also, if you remove the entry to the channel in the send () function, then the main function on the line b: = <-c will already be dead, on the contrary, it will have nothing to receive.
Channels are first-class objects, respectively, they can be created, transmitted, returned and assigned as it is convenient.

We will send data "grabber" and we will receive them in the main loop.

The quote in the page code simply lies in a div with the class "fi_text". In order to get it in the “grabber” we will use the goquery package - it allows you to parse the html-page and access its contents through selectors a la jQuery. This package is not in the standard package, so you need to install it first:
 #  : go get github.com/opesun/goquery 

And in the import section in the code we add the packages “github.com/opesun/goquery”, “strings” and “time” - we will need the latter for the delay, but we will not constantly pull the server with our requests (well, and so we will, but you understood me):
 import ( "flag" "fmt" "github.com/opesun/goquery" "strings" "time" ) 


Closer to the matter - write the code "grabber":
 func grab() <-chan string { //  ,        string c := make(chan string) for i := 0; i < WORKERS; i++ { //       - worker'o go func() { for { //     x, err := goquery.ParseUrl("http://vpustotu.ru/moderation/") if err == nil { if s := strings.TrimSpace(x.Find(".fi_text").Text()); s != "" { c <- s //     } } time.Sleep(100 * time.Millisecond) } }() } fmt.Println(" : ", WORKERS) return c } 

The code is very simple and requires almost no comments.
We use a closure to create the number of “Gourutin” we need to perform an anonymous function that constantly sends data to the channel returned by grab () . Such a pattern is called a generator.
For some reason, x.Find (". Fi_text"). Text () returned to me the contents of the desired element with spaces at the beginning and at the end, therefore, without hesitation, we clear it with the TrimSpace function from the standard strings module.

So the generator is ready, you can make sure that it works by modifying the function main ():
 func main() { quote_chan := grab() for i := 0; i < 5; i++ { // 5    fmt.Println(<-quote_chan, "\n") } } 

We start and see that everything is going according to plan: revelations flow into our channel in a wide stream!


Now we will think over the main cycle in which, according to the plan, values ​​should be collected. According to the requirements and wishes, we need a cycle that:



With the collection, we have no problems anymore, let's decide on the hashes.
For simplicity, I suggest taking md5 from a quote. We will store the hashes in the map (built-in structure for key-value stores), so they will be easily and quickly searched. Check the uniqueness, calculate statistics and repetitions - this is a matter of technology. Working with files in Go is no different from other languages, so there are no problems here either.

The time report can be implemented with the help of Ticker from the standard package " time ". This is the simplest timer that will work after a specified period of time and send a certain value to its channel — all you have to do is monitor the channel and output statistics when data arrives.

And we will catch the completion command with the help of the os / signal package, which allows a notifier to be sent to certain signals, sending event notifications to the channel.

The plan is ready - but there is one nuance: we have three different channels from which we want to receive data, however, it was said earlier that when waiting for a read, the stream is blocked, so we can wait for information on a maximum of one channel at a time.
But go, it is not for nothing that he eats his CPU time - another great tool is the select construct.
Select allows you to expect data from an unlimited number of channels, blocking the execution only in the case of the arrival of the next data, while they are being processed. What we need!

Let's get down to the code! First we will add the necessary packages to the import section, now it will look like this:
 import ( "flag" "fmt" "github.com/opesun/goquery" "strings" "time" //,        : "io" "os" "os/signal" //   -   : "crypto/md5" "encoding/hex" ) 

Yes, a lot of things ... Given the static linking Go, the size of the executable file should be impressive!
But not time to be sad, we will declare storage for hashes:
 var ( ... used map[string]bool = make(map[string]bool) //map        ,    -  . ) 

And finally, our main function turns into:
 func main() { //   ... quotes_file, err := os.OpenFile(QUOTES_FILE, os.O_APPEND|os.O_CREATE, 0666) check(err) defer quotes_file.Close() //...    hash_file, err := os.OpenFile(HASH_FILE, os.O_APPEND|os.O_CREATE, 0666) check(err) defer hash_file.Close() // Ticker          ticker := time.NewTicker(time.Duration(REPORT_PERIOD) * time.Second) defer ticker.Stop() // ,     ,     ... key_chan := make(chan os.Signal, 1) signal.Notify(key_chan, os.Interrupt) //...       hasher := md5.New() //    quotes_count, dup_count := 0, 0 // , ! quotes_chan := grab() for { select { case quote := <-quotes_chan: // ""  : quotes_count++ // ,     : hasher.Reset() io.WriteString(hasher, quote) hash := hasher.Sum(nil) hash_string := hex.EncodeToString(hash) //    if !used[hash_string] { //   -    ,        used[hash_string] = true hash_file.Write(hash) quotes_file.WriteString(quote + "\n\n\n") dup_count = 0 } else { //  -   ,    ? if dup_count++; dup_count == DUP_TO_STOP { fmt.Println("  ,  .  : ", len(used)) return } } case <-key_chan: //     : fmt.Println("CTRL-C:  .  : ", len(used)) return case <-ticker.C: //, ,        fmt.Printf(" %d /  %d (%d /) \n", len(used), dup_count, quotes_count/REPORT_PERIOD) quotes_count = 0 } } } 

The code is absolutely transparent, we’ll stop only on opening files:
The check () function is not standard - it is here for easy verification of the results of opening a file for errors. Here is its code, put it somewhere before main () :
 func check(e error) { if e != nil { panic(e) } } 

Another interesting point: we "immediately" cause the file to be closed, although later we are going to work with it:
 defer quotes_file.Close() ... defer hash_file.Close() ... //      : defer ticker.Stop() 

The bottom line is that by running a function with defer, we postpone its execution: it will be executed only before the function from which the defer call was completed (in other words, right before the “parent” return).

You can already start and rejoice at the execution of the “terribly necessary mission”, but one more small detail remains - you need to write a function to read the hashes from the file, in case we want to restart the program again, but do not want to see duplicates in the resulting file. Here is one way to do it:
 func readHashes() { //    if _, err := os.Stat(HASH_FILE); err != nil { if os.IsNotExist(err) { fmt.Println("   ,   .") return } } fmt.Println(" ...") hash_file, err := os.OpenFile(HASH_FILE, os.O_RDONLY, 0666) check(err) defer hash_file.Close() //    16  -    : data := make([]byte, 16) for { n, err := hash_file.Read(data) //n    ,  err - ,   . if err != nil { if err == io.EOF { break } panic(err) } if n == 16 { used[hex.EncodeToString(data)] = true } } fmt.Println(".  : ", len(used)) } 

That's all, it remains only to remember to place the readHashes () call at the beginning of main ()
 func main() { readHashes() ... 


Done! Combat launches:


Results of work:


Creating files works, rereading hashes when you restart, too. At the command, she stops, and she also knows how. The executable file, of course, is too big - but Go with it.
Our program for unloading any nonsense of very important data in the night is ready and does what we wanted from it.
Of course, there is still a huge list of improvements, but for the simple accomplishment of the combat mission of this code is more than enough!

Program code entirely on Pastbin.com

If you are interested in Go and its capabilities, here are a few places worth visiting:

tour.golang.org - Online language tour, editing the code directly in the browser is very exciting.

gobyexample.com - Examples of solving typical tasks

golang.org/ref/spec - Language Specification

golang.org/doc/effective_go.html - An article on the topic “and now we are going to take off with all this garbage” - how to take off and fly Go

Go groups dedicated to Google - " https://groups.google.com/forum/#!forum/golang-nuts " and " https://groups.google.com/forum/#!forum/golang-ru "

godoc.org - Searching for Go packages and their documentation, both built-in and third-party, is an excellent anti-bike thing!

Good luck!

Source: https://habr.com/ru/post/197598/


All Articles