📜 ⬆️ ⬇️

Building client routing / semantic search at Profi.ru

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru


Tldr


2 department of the DS Department for a bit more than a few months. done at first).


Projected goals


  1. If you’re looking at what you’re looking for, then you can’t understand how to do it.
  2. Find totally new services and synonyms for the existing services;
  3. As a sub-goal of (2) - learn to build proper clusters on arbitrary external corpuses;

Achieved goals


It was clear that I’m not able to get it. enough proxies + probably some experience with selenium).


Business goals:


  1. ~ 88+% (vs ~ 60% with elastic search) accuracy for client routing / intent classification (~ 5k classes);
  2. Search is agnostic to input quality (misprints / partial input);
  3. Classifier generalizes the morphologic structure of the language is exploited;
  4. Classifier severely beats elastic on various benchmarks (see below);
  5. New services were found + at least 15,000 synonyms (vs. the current state of 5,000 + ~ 30,000 ). I expect this figure to double;

The last bullet is a ballpark estimate, but a conservative one.
Also AB tests will follow. But I am confident in these results.


"Scientific" goals:


  1. We have been thoroughly compared with the database of service synonyms;
  2. We are able to beat weakly supervised (see their bag-size-a-bag-of-ngrams) elastic pattern on this benchmark (see details below) using UNSUPERVISED methods;
  3. We’ve been developing a model for the RLP, which is an RRNNGNG case-style bag;
  4. We demonstrated that our final embedding technique was combined with state-of-the-art unsupervised algorithms (UMAP + HDBSCAN) can produce stellar clusters;
  5. We demonstrated the possibility of feasibility and usability of:
    • Knowledge distillation;
    • Augmentations for text data (sic!);
  6. Training text-based classifiers with dynamic augmentations reduced convergence of time drastically (10x) compared to generating static datasets (ie, CNN);

Overall project structure


This does not include the final classifier.
We also have redeemed the classifier bottleneck.



What works in NLP now?


A birds' eye view:


Also you may know that NLP may be experiencing the moment now .


Large scale UMAP hack


UMAP to 100m + point (or maybe even 1bn) sized datasets. Essentially build a KNN graph with FAIS and then just rewrite the main UMAP loop into PyTorch using your GPU. We’ve only got 10-15m points after all, but please follow this thread for details.


What works best



Best classifier benchmarks


Manually annotated dev set


Left to right:
(Top1 accuracy)



Manually annotated dev set + 1-3 errors per query


Left to right:
(Top1 accuracy)



Manually annotated dev set + partial input


Left to right:
(Top1 accuracy)



Large scale corpuses / n-gram selection



Stress test of our 1M n-grams on 100M vocabulary:
image


Text augmentations


In a nutshell:



If you’re trying to get the best of your money, you’ll be able to do this. 30-50% of words we had on some corpuses .


Our approach is far superior, if you have access to a large domain vocabulary .


Best unsupervised / semi-supervised results


KNN is used to compare different embedding methods.


(vector size) List of models tested:



default


To avoid leaks, all sampled. It was compared with services / synonyms. It was taken by the women to get vocabularies (it was not embraced by the words of the Wikipedia sentence).


Cluster visualization


3D


2D


Cluster exploration "interface"


Green - new word / synonym.
Gray background - likely new word.
Gray text - existing synonym.



We didn’t


  1. See the above charts;
  2. Plain average / tf-idf average of fast-text embeddings - a VERY formidable baseline ;
  3. Fast-text> Word2Vec for Russian;
  4. Sentence embedding by fake;
  5. BPE (sentencepiece) showed no improvement on our domain;
  6. Char level models struggled to generalize, despite the recant paper from google;
  7. We tried a multi-head transformer (LSTM-based models.) When we migrated, we’dn’t be able to follow the embracing bag;
  8. BERT - it seems to be overkill
  9. ELMO - Seems like I think it’s not worth it;

Deploy


Done using:



')

Source: https://habr.com/ru/post/428674/


All Articles