📜 ⬆️ ⬇️

Search InterSystems documentation using iKnow and iFind technologies

image

InterSystems Caché has built-in technology for working with iKnow unstructured data, as well as iFind full-text search technology. We decided to deal with the technology and at the same time do something useful. The result was DocSearch - Web application for searching InterSystems documentation using iKnow and iFind technologies.

How is the documentation in Caché


Caché documentation is based on Docbook technology. A web interface is supplied for access to the documentation (including a search that does not use either iFind or iKnow). Actually, the data of the documentation articles are in the Caché classes, which opens up the possibility to independently carry out queries to this data, and, accordingly, the opportunity to write your own search utility.

What is iKnow and iFind:


Intersystems iKnow is an unstructured data analysis tool that provides access to data by indexing the sentences and entities contained in the text. To begin the analysis, it is necessary to create a domain - storage of unstructured data, and load text into it. The process of creating a domain is well described here and here . About the main ways of using iKnow is written here , I would also recommend this article to you.


The iFind technology is a Caché DBMS module for performing full-text search operations using data from Caché classes. iFind uses many of the features of iKnow to provide intelligent text search. To use iFind in queries, you need to define a special iFind index in the Caché class.


There are three types of iFind indexes, each type of index provides all the functions of the previous type, plus additional functions:
')


Since documentation classes are stored in a separate area, in order to make classes available in our area, the installer performs mapping of packages and globals.

Code for mapping in installer
XData Install [ XMLNamespace = INSTALLER ] { <Manifest> //    <IfNotDef Var="Namespace"> <Var Name="Namespace" Value="DOCSEARCH"/> <Log Text="Set namespace to ${Namespace}" Level="0"/> </IfNotDef> //      <If Condition='(##class(Config.Namespaces).Exists("${Namespace}")=1)'> <Log Text="Namespace ${Namespace} already exists" Level="0"/> </If> //   <If Condition='(##class(Config.Namespaces).Exists("${Namespace}")=0)'> <Log Text="Creating namespace ${Namespace}" Level="0"/> //    <Namespace Name="${Namespace}" Create="yes" Code="${Namespace}" Ensemble="" Data="${Namespace}"> <Log Text="Creating database ${Namespace}" Level="0"/> //         <Configuration> <Database Name="${Namespace}" Dir="${MGRDIR}/${Namespace}" Create="yes" MountRequired="false" Resource="%DB_${Namespace}" PublicPermissions="RW" MountAtStartup="false"/> <Log Text="Mapping DOCBOOK to ${Namespace}" Level="0"/> <GlobalMapping Global="Cache*" From="DOCBOOK" Collation="5"/> <GlobalMapping Global="D*" From="DOCBOOK" Collation="5"/> <GlobalMapping Global="XML*" From="DOCBOOK" Collation="5"/> <ClassMapping Package="DocBook" From="DOCBOOK"/> <ClassMapping Package="DocBook.UI" From="DOCBOOK"/> <ClassMapping Package="csp" From="DOCBOOK"/> </Configuration> <Log Text="End creating database ${Namespace}" Level="0"/> </Namespace> <Log Text="End creating namespace ${Namespace}" Level="0"/> </If> </Manifest> } 


The domain that we need to work iKnow, built on the table containing the documentation. Since the data source is a table, we will use SQL.Lister. The content field contains the text of the documentation, so we specify it as a data field. The remaining fields will be indicated in the metadata.


Installer Domain Creation Code
 ClassMethod Domain(ByRef pVars, pLogLevel As %String, tInstaller As %Installer.Installer) As %Status { #Include %IKInclude #Include %IKPublic set ns = $Namespace znspace "DOCSEARCH" //        set dname="DocSearch" if (##class(%iKnow.Domain).Exists(dname)=1){ write "The ",dname," domain already exists",! zn ns quit } else { write "The ",dname," domain does not exist",! set domoref=##class(%iKnow.Domain).%New(dname) do domoref.%Save() } set domId=domoref.Id // Lister    ,      set flister=##class(%iKnow.Source.SQL.Lister).%New(domId) set myloader=##class(%iKnow.Source.Loader).%New(domId) //   set myquery="SELECT id, docKey, title, bookKey, bookTitle, content, textKey FROM SQLUser.DocBook" set idfld="id" set grpfld="id" //      set dataflds=$LB("content") set metaflds=$LB("docKey", "title", "bookKey", "bookTitle", "textKey") //    Lister set stat=flister.AddListToBatch(myquery,idfld,grpfld,dataflds,metaflds) if stat '= 1 {write "The lister failed: ",$System.Status.DisplayError(stat) quit } //   set stat=myloader.ProcessBatch() if stat '= 1 { quit } set numSrcD=##class(%iKnow.Queries.SourceQAPI).GetCountByDomain(domId) write "Done",! write "Domain cointains ",numSrcD," source(s)",! zn ns quit } 


To search the documentation we use the index% iFind.Index.Analytic:


 Index contentInd On (content) As %iFind.Index.Analytic(LANGUAGE = "en", LOWER = 1, RANKERCLASS = "%iFind.Rank.Analytic"); 

Where contentInd is the index name, content is the name of the field for which we are creating the index.
The LANGUAGE = "en" parameter specifies the language in which the text is written.
LOWER = 1, sets case insensitivity
The parameter RANKERCLASS = "% iFind.Rank.Analytic", allows you to use the algorithm for ranking the results of TF-IDF

After adding and building such an index, it can be used, for example, in SQL queries. The general syntax for using iFind in SQL is:


 SELECT * FROM TABLE WHERE %ID %FIND search_index(indexname,'search_items',search_option) 

After creating the index% iFind.Index.Analytic with such parameters, several SQL procedures are generated - [Table name] _ [Index name] Procedure name


In our project we use two of them:


About the use of these procedures, I will tell you below.

What eventually happened:


  1. Autocomplete in the search bar


    When you enter text in the search bar, possible queries are suggested to help you quickly find the information you need. These prompts are created based on the word (or the initial part of the word, if the word input is not completed) that you entered and the ten most similar words or phrases are displayed to the user.

    This process occurs using iKnow,% iKnow.Queries.Entity.GetSimilar method


    image

  2. Fuzzy search


    Technology iFind supports fuzzy search, to find words that almost match the search string. Implemented by comparing the Levenshtein distance between two words. The Levenshtein distance is the minimum number of one-character changes (insert, delete, or replace) needed to change one word to another. It can be used to correct typos, small variations in writing, various grammatical forms (singular and plural).


    In iFind SQL queries, the search_option parameter is responsible for using fuzzy search.
    The value search_option = 3 means Levenshtein distances equal to two.

    To set the Levenshtein distance equal to n, you must specify the value search_option = '3: n'
    In the search for documentation, Levenshtein distance equal to one is used, we will demonstrate how it works:

    Type in the search word ifind:


    image

    Let's try to make a fuzzy search, for example, a word with a typo - ifindd. As we can see, the search corrected a typo and found the necessary articles.


    image

  3. Complex queries


    Due to the fact that iFind supports complex queries using brackets and AND OR NOT operators, we implemented an advanced search. In the search you can specify: a word, phrases, any of several words, or not containing some words. Fields can be filled as one or several, and all at once.


    For example, find articles containing the word iknow, the phrase rest api and containing any of the words domain or UI.


    image

    We see that there are two such articles:


    image

    Note that the second article mentions Swagger UI, you can add to the query, search for articles that do not contain the word Swagger


    image

    As a result, only one article was found:


    image

  4. Highlight Search Results


    As mentioned above, using the iFind index creates the DocBook_contentIndHighlight procedure. Using:


     SELECT DocBook_contentIndHighlight(%ID, 'search_items', '0', '<span class=""Illumination"">', 0) Text FROM DocBook 

    We get the search text framed in a tag


     <span class="Illumination"> 

    This allows you to visually highlight search results on the frontend.


    image

  5. Results Ranking Algorithm


    iFind supports the ability to rank the results by the TF-IDF algorithm. The TF-IDF measure is often used in text analysis and information retrieval tasks, for example, as one of the criteria for the relevance of a document to a search query.


    As a result of the SQL query, the Rank field will contain the weight of the word, which is proportional to the number of words used in the article, and inversely proportional to the frequency of the use of the word in other articles.


     SELECT DocBook_contentIndRank(%ID, 'SearchString', 'SearchOption') Rank FROM DocBook WHERE %ID %FIND search_index(contentInd,'SearchString', 'SearchOption') 

  6. Integration with official documentation search


    After installation, the button “Search using iFind” is added to the official documentation search.


    image

    If the Search words field is filled in, then after clicking on “Search using iFind”, the system will go to the search results page for the entered query.


    If the field is empty, the new search page will be taken to the start page.

Installation


  1. Download the installer.xml file from the latest release from the release page.
  2. Import the downloaded Installer.xml file into the% SYS area, compile.
  3. In the terminal in the% SYS area, enter the following command:

     do ##class(Docsearch.Installer).setup(.pVars) 

    The process takes about 15-30 minutes due to the process of building a domain.

After that, the search is available at localhost : [port] /csp/docsearch/index.html

Demo


An online search demo is available here .

Conclusion


This project demonstrates the interesting and useful features of iFind and iKnow technologies, thanks to which the search becomes more relevant.
Criticism, comments, suggestions - are welcome.
All source code with the installer and installation instructions posted on the github

Source: https://habr.com/ru/post/333582/


All Articles