⬆️ ⬇️

Patties in go





In continuation of the epic with distributive semantic cakes (and in pursuit of fashion trends), I decided to rewrite the web service from a lapidary python to a progressive Go. At the same time, he was forced to transfer the entire “intellectual” part (good, not Bin Newton). It turned out to be much easier and more pleasant than it was supposed at the beginning. However, at the honey-syntactic celebration of life, it was not without a spoon of tar - the fastest gnomer “crusher” that I could find (mat from gonum) still yielded in speed to the Python bond numba + numpy.



To implement our plans, it was necessary:





Loading word2vec models



Everything is simple here - we read the dictionary from the binary and the vectors to it with passing normalization of vectors and the formation of a map (map) word - the vector index. The display gives a quick pull of the vector by word. Normalization saves time when calculating cosine proximity - comparing words is reduced to a scalar product, and comparing “bags” (bag of words) to matrix multiplication.

')

Code
type W2VModel struct { Words int Size int Vocab []string WordIdx map[string]int Vec [][]float32 } func (m *W2VModel) Load(fn string) { file, err := os.Open(fn) if err != nil { log.Fatal(err) } fmt.Fscanf(file, "%d", &m.Words) fmt.Fscanf(file, "%d", &m.Size) var ch string m.Vocab = make([]string, m.Words) m.Vec = make([][]float32, m.Words) m.WordIdx = make(map[string]int) for b := 0; b < m.Words; b++ { m.Vec[b] = make([]float32, m.Size) fmt.Fscanf(file, "%s%c", &m.Vocab[b], &ch) m.WordIdx[m.Vocab[b]] = b binary.Read(file, binary.LittleEndian, m.Vec[b]) length := 0.0 for _, v := range m.Vec[b] { length += float64(v * v) } length = math.Sqrt(length) for i, _ := range m.Vec[b] { m.Vec[b][i] /= float32(length) } } file.Close() } 


Reading "poetic" model



It is even easier to read the JSON file created in Python in advance into Go structures and slices - it is easier to read, the main thing is not to forget about capital letters in field names. And so that everything is calculated quickly, we stamp the matrices from the bag-pies without departing from the cash register.



Code
 type PoemModel struct { Poems []string `json:"poems"` Bags [][]string `json:"bags"` W2V W2VModel Vectors [][][]float32 Matrices []mat.Matrix } func (pm *PoemModel) LoadJsonModel(fileName string) error { file, err := ioutil.ReadFile(fileName) if err != nil { return err } err = json.Unmarshal(file, pm) if err != nil { return err } return nil } func (pm *PoemModel) Matricize() { pm.Matrices = make([]mat.Matrix, len(pm.Bags)) for idx, bag := range pm.Bags { data, rows := pm.TokenVectorsData(bag) pm.Matrices[idx] = mat.NewDense(rows, pm.W2V.Size, data).T() } } 


Morphological analyzer



The world is not without good people - there was a good man who translated pymorphy2 to Go. It was necessary, however, to add a couple of lines in the source code, because setting the morphological dictionaries by the python package manager, and then looking for them through the python is an idea, to put it mildly, not comme il faut. Out of harm, I threw the dictionaries (along with the analyzer that I edited) into my project.



"Intellectual" part



Tokenizer - translates words into normal form (lemmatization), adds corresponding (word2vec models) grammatical suffixes (NOUN, VERB, ADJ, etc.) and eliminates stop words (any pronouns, prepositions, particles).



Code
 func (pm *PoemModel) TokenizeWords(words []string) []string { POS_TAGS := map[string]string { "NOUN": "_NOUN", "VERB": "_VERB", "INFN": "_VERB", "GRND": "_VERB", "PRTF": "_VERB", "PRTS": "_VERB", "ADJF": "_ADJ", "ADJS": "_ADJ", "ADVB": "_ADV", "PRED": "_ADP", } STOP_TAGS := map[string]bool {"PREP": true, "CONJ": true, "PRCL": true, "NPRO": true, "NUMR": true} result := make([]string, 0, len(words)) for _, w := range words { _, morphNorms, morphTags := morph.Parse(w) if len(morphNorms) == 0 { continue } suffixes := make(map[string]bool) // added suffixes for i, tags := range morphTags { norm := morphNorms[i] tag := strings.Split(tags, ",")[0] _, hasStopTag := STOP_TAGS[tag] if hasStopTag { break } suffix, hasPosTag := POS_TAGS[tag] _, hasSuffix := suffixes[suffix] if hasPosTag && ! hasSuffix { result = append(result, norm + suffix) suffixes[suffix] = true } } } return result } 


The search for semantically “resonating” pies is obtained by successively multiplying the matrix of vectors formed from the query words with all the matrices of pies made when the model is loaded. The result of each work (i.e., the matrix) is summed and normalized by dividing by the number of word vectors in the multiplied matrices, the resulting “resonance” numbers (previously attached to the indexes of the cakes) are sorted in descending order, giving the top most.



Code
 func (pm *PoemModel) SimilarPoemsMx(queryWords []string, topN int) []string { simPoems := make([]string, 0, topN) tokens := pm.TokenizeWords(queryWords) queryData, queryVecsN := pm.TokenVectorsData(tokens) if len(tokens) == 0 || topN <= 0 || queryVecsN == 0{ return simPoems } queryMx := mat.NewDense(queryVecsN, pm.W2V.Size, queryData) type PoemSimilarity struct { Idx int Sim float64 } sims := make([]PoemSimilarity, len(pm.Bags)) for idx, _ := range pm.Bags { var resMx mat.Dense bagMx := pm.Matrices[idx] _, poemVecsN := bagMx.Dims() resMx.Mul(queryMx, bagMx) sim := mat.Sum(&resMx) if poemVecsN > 0 { sim /= float64(poemVecsN + queryVecsN) } sims[idx].Idx = idx sims[idx].Sim = sim } sort.Slice(sims, func (i, j int) bool { return sims[i].Sim > sims[j].Sim }) for i := 0; i < topN; i ++ { simPoems = append(simPoems, pm.Poems[sims[i].Idx]) } return simPoems } 


Web service



To implement the web part, I used the gin-gonic package — a router, static, CORS — everything.



Project on Github



Service to try

Source: https://habr.com/ru/post/343788/



All Articles