📜 ⬆️ ⬇️

Applicative regular expressions as a free alternative functor


I bring to your attention the translation of a wonderful fresh article by Justin Le. In his blog in Code, this author speaks in a fairly easy way about the mathematical essence of beautiful and elegant functional solutions for practical problems. This article examines in detail an example of how the transfer of the mathematical structure, which the data in the domain forms to the program type system, can immediately, as Gerald and Sassman wrote "automagically", lead to a working solution.


The code shown in the picture is a complete, self-contained, extensible implementation of the regular expression parser, written from scratch. The highest class, the real magic of types!


Today we implement applicative regular expressions and parsers (in the spirit of the regex-applicative library) using free algebraic structures! Free structures are one of my favorite tools in Haskell, and I already wrote earlier about free groups , variations on the theme of free monads, and about the "free" applicative functor on monoids .


Regular expressions (and parsers for them) are ubiquitous in programming and computer science, so I hope, with my demonstration of how simple they are implemented using free structures, I will help the reader to appreciate the merits of this approach without fear of getting bogged down in unnecessary details.


All the code in this article is available online in the form of “stack executable”. If you start it ( ./regexp.hs ), the GHCi session will be launched with all the definitions, so you will have the opportunity to play with the functions and their types.


This article will be quite understandable to the “advanced novice” or “novice specialist” in Haskell. It requires knowledge of the basic concepts of a language: pattern matching, algebraic data types, and abstractions such as monoids, functors, and do-notations.


Regular languages


A regular expression is a way to define some regular language. Formally, such an expression is built from three basic elements:


  1. An empty set is an element that does not match anything.
  2. An empty string is a neutral element that is trivially matched with an empty string.
  3. A literal is a symbol associated with itself. A lot of one element.

And also from three operations:


  1. Concatenation: RS , sequence of expressions. The product of sets (Cartesian).
  2. Alternative: R|S , choice between expressions. Union of sets.
  3. Star Kleene: R* , the expression is repeated an arbitrary number of times (including zero).

And this is all that makes up regular expressions, no more, no less. From these basic components, you can construct all other known operations on regular expressions — for example, a+ can be expressed as aa* , and categories like \w can be represented as an alternative to suitable characters.


Translator's Note

The given minimal definition of a regular language is sufficiently complete for a mathematician, but impractical. For example, the operation of negation or addition (“any character except the specified one”) can be written within the framework of the basic definition, but its direct application will lead to an exponential growth of the resources used.


Alternative functor


When looking at the structure of regular expressions, does it seem familiar? It reminds me very much of the Alternative type class. If the functor belongs to this class, then this means that for it is defined:


  1. Empty empty element, corresponding to failure, or an error in the calculations.
  2. pure x is a single element (from the Applicative class).
  3. The <*> operation organizing sequential calculations.
  4. Operation <|> , organizing alternative calculations.
  5. The many function is an operation of repeating the calculations zero or more times.

All this is very similar to the construction of a regular language, isn't it? Perhaps the alternative functor is almost what we need, the only thing missing is the primitive corresponding to the literal symbol.


Anyone not familiar with the Alternative class can find a good introduction to Typeclassopedia . But within our article, this class is simply a "double monoid" with two ways of combining <*> and <|> , which, in a sense, can be compared with the operations * and + for numbers. In general, for the definition of an alternative functor, the five points listed above and some additional laws of commutativity and distributivity are sufficient.


Translator's Note (Nerds)

To be precise, the author got a little excited with the "double monoid". The class Alternative extends the applicative functor, which is (under certain restrictions) a semigroup, to a semiring, where the addition operation <|> with the neutral element empty plays the role of a commutative monoid. Application operator


 (<*>) :: Applicative f => f (a -> b) -> fa -> fb 

cannot act as an analogue of the multiplication operation in a semiring, because it does not even form magma . However, along with the <*> operator, the "one-sided" *> and <* operators are defined in the Control.Applicative package. Each of them ignores the result of the work of the operand, which does not show the "corner":


 (<*) :: Applicative f => fa -> fb -> fa (*>) :: Applicative f => fa -> fb -> fb 

If the types a and b coincide, then with these operations we get a semigroup (associativity follows from the properties of the composition). It can be verified that for an alternative functor, multiplication is distributive with respect to addition, both on the right and on the left, and, moreover, the neutral element for addition (analogue of zero) is an absorbing element for the multiplication operation.


Semi-rings also form numbers, sets, matrices of semirings, algebraic types, and ... regular expressions, so, in fact, we are talking about the same algebraic structure.


Thus, we can consider regular expressions as an alternative functor, plus a primitive for a literal character. But there is another way to look at them, and it leads us straight to free structures. Instead of "an alternative functor with literals", we can turn the literal into an instance of the class Alternative .


freedom


Let's write this. Type for primitive literal:


 data Prim a = Prim Char a deriving Functor 

Note that since we work with functors (applicative, alternative), then a certain “result” will be associated with all our regular expressions. This is due to the fact that when defining an instance for the classes Functor , Applicative and Alternative we must have type-parameter.


On the one hand, you can ignore this type, but on the other hand, you should use this value as a result of matching with a regular expression, as is done in industrial applications that work with regular games.


In our case, Prim 'a' 1 :: Prim Int will represent the primitive that is mapped to the character 'a' , and is immediately interpreted, resulting in a one.


Well, now ... let's give our primitive the necessary mathematical structure using the free alternative functor from the free library:


 import Control.Alternative.Free type RegExp = Alt Prim 

That's all! This is our full-fledged type for regular expressions! By declaring the Alt type as an instance of the Functor class, we obtained all the operations from the Applicative and Alternative classes, since in this case there are instances of Applicative (Alt f) and Alternative (Alt f) . Now we have:



And we got all this completely "free", that is, "for free"!


Essentially, a free structure automatically provides us with only an abstraction for the base type and nothing more. But regular expressions, by themselves, also represent only a structure: the basic elements and a set of operations, no more, no less, so a free alternative functor provides us with exactly what we need. Not more, but not less.


After adding some handy wrapper functions ... the work on the type is complete!


 -- | charAs:   ,    charAs :: Char -> a -> RegExp a charAs cx = liftAlt (Prim cx) -- liftAlt :: fa -> Alt fa   --   Prim   RegExp -- | char:         char :: Char -> RegExp Char char c = charAs cc -- | string:         string :: String -> RegExp String string = traverse char -- , ? 

Examples


Well, what, try? Let's construct the expression (a|b)(cd)*e , returning, in case of successful matching, the unit type () :


 testRegExp_ :: RegExp () testRegExp_ = void $ (char 'a' <|> char 'b') *> many (string "cd") *> char 'e' 

The function void :: Functor f => fa -> f () from the Data.Functor package discards the result, we use it, because here we are only interested in the success of the comparison. But the operators <|> , *> and many are used by us exactly as it is assumed during concatenation or selection of one of the options.


Here is an interesting example more difficult, let's define the same regular expression, but now, as a result of matching, we count the number of repetitions of the substring cd .


 testRegExp :: RegExp Int testRegExp = (char 'a' <|> char 'b') *> (length <$> many (string "cd")) <* char 'e' 

There is a subtlety in the work of the operators *> and <* : arrows show the result that should be kept. And since many (string "cd") :: RegExp [String] returns a list of duplicate elements, we can, staying inside the functor, calculate the length of this list, getting the number of repetitions.


Moreover, the GHC compiler -XApplicativeDo allows -XApplicativeDo to write our expression using do-notation, which is probably easier to understand:


 testRegExpDo :: RegExp Int testRegExpDo = do char 'a' <|> char 'b' cds <- many (string "cd") char 'e' pure (length cds) 

It's all in something like how we “capture” the result of parsing a string using a regular expression, accessing it. Here is an example in Ruby:


 irb> /(a|b)((cd)*)e/.match("acdcdcdcde")[2] => "cdcdcdcd" 

with the only difference that we added some post-processing to calculate the number of repetitions.


Here is another convenient regular \d , corresponding to a digit from 0 to 9 and returning a number:


 digit :: RegExp Int digit = asum [ charAs (intToDigit i) i | i <- [0..9] ] 

Here, the asum function from the Control.Applicative.Alternative package represents a choice from the elements of the asum [x,y,z] = x <|> y <|> z intToDigit , and the intToDigit function intToDigit defined in the Data.Char package. And, again, we can create quite elegant things, for example, the expression \[\d\] corresponding to a number in square brackets:


 bracketDigit :: RegExp Int bracketDigit = char '[' *> digit <* char ']' 

Parsing


Well, well, all we did was describe the data type for the literal with concatenation, choices and repetitions. Fine! But what we really need is a matching string with a regular expression, right? How will the free alternative functor help us in this? In fact, significantly help. Let's look at two ways to do it!


Unload an alternative functor


What is "freedom"?


The canonical way to use a free structure is to wrap it into a specific structure using a suitable algebra. For example, the foldMap transformation turns a free monoid (list) into the value of any instance of the Monoid class:


 foldMap :: Monoid m => (a -> m) -> ([a] -> m) 

The foldMap function turns the transformation a -> m into the transformation [a] -> m (or, FreeMonoid a -> m ), with a specific monoid m . The general idea is that using a free structure allows you to postpone its specific use “for later”, separating the creation time and the time it takes to use the structure.


For example, we can construct a free monoid of numbers:


 -- |  "" `Int`     `Int`,  `liftAlt`. liftFM :: Int -> [Int] liftFM x = [x] myMon :: [Int] myMon = liftFM 1 <> liftFM 2 <> liftFM 3 <> liftFM 4 

And now we can decide how we want to interpret the <> operation:
Maybe this addition?


 ghci> foldMap Sum myMon Sum 10 -- 1 + 2 + 3 + 4 

Or multiplication?


 ghci> foldMap Product myMon Product 24 -- 1 * 2 * 3 * 4 

Or maybe the calculation of the maximum number?


 ghci> foldMap Max myMon Max 4 -- 1 `max` 2 `max` 3 `max` 4 

The idea is to postpone the choice of a particular monoid, first constructing a free collection of numbers 1, 2, 3, and 4. A free monoid on numbers defines such a structure over them as needed, no more, no less. To use foldMap we specify "how to perceive the base type" by passing the <> operator to a particular monoid.


Interpretation in the State functor


In practice, obtaining a result from a free structure consists in finding (or creating) a suitable functor that will provide us with the desired behavior. In our case, we are lucky, there is a concrete implementation of the class Alternative , which works exactly as we need: StateT String Maybe .


The product <*> for this functor consists in organizing a sequence of state changes. In our case, under the state, we will consider the remainder of the line being parsed, so that the consistent parsing is the best possible match for the operation <*> .


And its sum <|> works as backtracking, search with return to alternative in case of failure. It saves the state since the last successful parsing and returns to it when an alternative is unsuccessful. This is exactly the behavior that we expect from the expression R|S


Finally, the natural transformation for a free alternative functor is called runAlt :


 runAlt :: Alternative f => (forall b. pb -> fb) -> Alt pa -> fa 

Or, for the RegExp type:


 runAlt :: Alternative f => (forall b. Prim b -> fb) -> RegExp a -> fa 

If you are not familiar with RankN types (with the forall b. Construction), then you can see a good introduction here . The point here is that you need to provide a runAlt function that works with Prim b for absolutely any b , and not for any particular type, like Int and Bool , for example. That is, as with working with foldMap we only need to specify what to do with the base type. In our case, answer the question: "What should be done with the Prim type?"


 processPrim :: Prim a -> StateT String Maybe a processPrim (Prim cx) = do d:ds <- get guard (c == d) put ds pure x 

This is the interpretation of Prim as an action in the context of a StateT String Maybe , where the state is an unassembled string. Recall that Prim contains information about the matching character c and its interpretation in the form of some value x . Prim processing consists of the following steps:



This function can already be used to match RegEx with a string prefix. To do this, you need to run calculations using runAlt and runStateT , passing the last string to the function to be parsed as an argument:


 matchPrefix :: RegExp a -> String -> Maybe a matchPrefix re = evalStateT (runAlt processPrim re) 

That's all! Let's see how our first solution works:


 ghci> matchPrefix testRegexp_ "acdcdcde" Just () ghci> matchPrefix testRegexp_ "acdcdcdx" Nothing ghci> matchPrefix testRegexp "acdcdcde" Just 3 ghci> matchPrefix testRegexp "bcdcdcdcdcdcdcde" Just 7 ghci> matchPrefix digit "9" Just 9 ghci> matchPrefix bracketDigit "[2]" Just 2 ghci> matchPrefix (many bracketDigit) "[2][3][4][5]" Just [2,3,4,5] ghci> matchPrefix (sum <$> many bracketDigit) "[2][3][4][5]" Just 14 

Wait, what was that?


It seems that everything happened a little faster than you expected. A minute ago we wrote our primitive, and then again! and a working parser is ready. Here, in fact, all the key code, a few lines in Haskell:


 import Control.Monad.Trans.State (evalStateT, put, get) import Control.Alternative.Free (runAlt, Alt) import Control.Applicative import Control.Monad (guard) data Prim a = Prim Char a deriving Functor type RegExp = Alt Prim matchPrefix :: RegExp a -> String -> Maybe a matchPrefix re = evalStateT (runAlt processPrim re) where processPrim (Prim cx) = do d:ds <- get guard (c == d) put ds pure x 

And do we have a full-featured regular expression parser? What happened?


Recall that at a high level of abstraction, Alt Prim already contains pure , empty , Prim , <*> , <|> , and many in its structure (with this operator there is one unpleasant subtlety, but more about it later). What essentially does runAlt - uses the behavior of a particular alternative functor (in our case, StateT String Maybe ) to control the behavior of the operators pure , empty , <*> , <|> , and many . However, StateT does not have a built-in operator, similar to Prim , and for this we need to write a processPrim .


So, for the Prim type, the runAlt function uses the runAlt , and for pure , empty , <*> , <|> , and many , the appropriate instance of the Alternative class is used. Thus, the StateT functor performs 83% of the work for us, and the remaining 17% is the processPrim . In truth, this is somewhat disappointing. One may ask: why was it at all to start with the Alt wrapper? Why not immediately determine the type RegExp = StateT String Maybe and the appropriate primitive char :: Char -> StateT String Maybe Char ? If everything is done in the StateT StateT , then why bother with Alt , a free alternative functor?


The main advantage of Alt over StateT is that StateT is ... quite a powerful tool. But in fact, it is powerful, to the point of absurdity. With it, you can represent a huge number of various calculations and structures, and, which is unpleasant, it is easy to imagine something that is not a regular expression. Let's say something elementary like put "hello" :: StateT String Maybe () does not match any regular regular expression, but it is of the same type as RegExp () . Thus, while we say that Alt Prim matches a regular expression, not more, but not less, we cannot say the same about StateT String Maybe . The Alt Prim type is the perfect representation of a regular expression. Everything that can be expressed with its help is a regular expression, but something that is not such an expression with its help will not work. Here, however, there are also some unpleasant subtleties associated with Haskell's laziness, more on this later.


Here we can view StateT only as a context used for one
Regular expression interpretations - in the form of a parser. But you can imagine other ways to use the type RegExp . For example, we may need its textual representation; this is what StateT will not allow.


We cannot say that StateT String Maybe is a regular expression, only that this functor can represent a parser based on regular grammars. But about Alt Prim we can say for sure that this is a regular expression ( as mathematicians say, they are equal to within isomorphism, approx. Lane ).


Direct use of free structure


All this, of course, is very good, but what if we don’t want to shift 83% of the work to code for a type that was written by someone for us. Is it possible to use the free Alt structure directly to write a parser? This question is similar to this: how to write a function that processes lists (by matching constructors (:) and [] ) instead of using only foldMap ? How to directly operate this structure instead of delegating work to a specific monoid?


Glad you asked! Let's take a look at the definition of a free alternative functor:


 newtype Alt fa = Alt { alternatives :: [AltF fa] } data AltF fa = forall r. Ap (fr) (Alt f (r -> a)) | Pure a 

This is an unusual type, defined through mutual recursion, so that it can look very confusing. One of the ways to understand it is to imagine that Alt xs contains a chain of alternatives formed by the operator <|> . And each such alternative is represented by the type AltF , which is a sequence of functors f , formed by the operator <*> (as a sequence of nested functions).


It can be considered AltF faas a simply linked list [fr], with different rfor each element. Apcorresponds to the constructor (:)that contains fr, and Pureto the end of the list []. The construction forall r.here denotes a quantifier of existence from an extension, -XExistentialQuantificationand it is this one that allows connecting various types into a chain.


In the end, Alt flike a list of alternatives, each of which represents a chain of applications. Or you can look at it as a normalized form of sequential (or nested) operations <*>and <|>, just as the type [a]is a normalized form of sequential operations <>.


Translator's Note

, :


  • () — , <> :
     [a,b,c,d] = [a]<>[b]<>[c]<>[d] 
  • () — , + , — , * :
     a*(b+c)+d*(x+y+z)*h 
  • (Alt f) — , <|> , — , <*> :
     fa <*> (fb <|> fc) <|> fd <*> (fx <|> fy <|> fz) <*> fh 

, RegExp a -> String -> Maybe a , , . : .


, Alt . , , .


 matchAlts :: RegExp a -> String -> Maybe a matchAlts (Alt res) xs = asum [ matchChain re xs | re <- res ] 

asum :: [Maybe a] -> Maybe a , Just . ( , Maybe a Alternative — Nothing , <|> . . . )


. AltF , Ap Pure :


 matchChain :: AltF Prim a -> String -> Maybe a matchChain (Ap (Prim cx) next) cs = _ matchChain (Pure x) cs = _ 

" ": GHC "", , , . ( Haskell "" (holes) , _ , . . . ) :


 matchChain :: AltF Prim a -> String -> Maybe a matchChain (Ap (Prim cx) next) cs = case cs of [] -> Nothing d:ds | c == d -> matchAlts (($ x) <$> next) ds | otherwise -> Nothing matchChain (Pure x) _ = Just x 

Ap ( (:) ), , - . , . Prim r , , next :: RegExp (r -> a) . , next . , "" , Nothing . , Pure x ( [] ), , , .


, , . , , " " Ap , Pure , AltF .., .


:


 ghci> matchAlts testRegexp_ "acdcdcde" Just () ghci> matchAlts testRegexp_ "acdcdcdx" Nothing ghci> matchAlts testRegexp "acdcdcde" Just 3 ghci> matchAlts testRegexp "bcdcdcdcdcdcdcde" Just 7 ghci> matchAlts digit "9" Just 9 ghci> matchAlts bracketDigit "[2]" Just 2 ghci> matchAlts (many bracketDigit) "[2][3][4][5]" Just [2,3,4,5] ghci> matchAlts (sum <$> many bracketDigit) "[2][3][4][5]" Just 14 

Translator's Note

. , () , . , , .


?


foldMap . , , foldMap, , , , ! , — , — (:) [] .


, , : , , , (:) , [] . , . , [1,2,3] <> [4] , [1] <> [2,3] <> [4] . , , .


. , :


 data RegExp a = Empty | Pure a | Prim Char a | forall r. Seq (RegExp r) (RegExp (r -> a)) | Union (RegExp a) (RegExp a) | Many (RegExp a) 

RegExp , . . , RegExp :


 -- | a|(b|c) abc1 :: RegExp Int abc1 = Prim 'a' 1 `Union` (Prim 'b' 2 `Union` Prim 'c' 3) -- | (a|b)|c abc2 :: RegExp Int abc2 = (Prim 'a' 1 `Union` Prim 'b' 2) `Union` Prim 'c' 3 

, .


Alt Prim , , , . , matchAlts , . (a|b)|c a|(b|c) . Alt . , , .


, , (a|b)|c , (a|b)|c , , , RegExp . Alt , .


, , Alt , Alt Prim . , Alt Prim a|a a . , Alt f f . , : . , , , .



, . , , . RegExp , , — .


, Haskell. , - [a] . ( , - , "" ⊥, - "" . approx. . )


: a|aa|aaa|aaaa|aaaaa|aaaaaa|... , ( , , , . . . ). , Haskell . , Alt many . , a* a|aa|aaa|aaaa|aaaaa|aaaaaa|... , . - many (char 'a') , . Haskell Alt , .


, , , , (), . , many .


, ! "" Alt , Control.Alternative.Free.Final , many (, , ).


, , runAlt . , Alternative , many ( RE regex-applicative ) . , Haskell , , , many .


, . ( ), ( , , . . ). matchPrefix , , , . , , , , . , GHC .



, , tails ( ) mapMaybe ( ). , , listToMaybe .


 matches :: RegExp a -> String -> [a] matches re = mapMaybe (matchPrefix re) . tails firstMatch :: RegExp a -> String -> Maybe a firstMatch re = listToMaybe . matches re 

, , matchPrefix Nothing , listToMaybe , Nothing ( , . . . ).


, . , , — , . , . , , , .


Alt Prim , : , , , .


? . :


 data Prim a = Only Char a --    | Letter a --     | Digit (Int -> a) --    , | Wildcard (Char -> a) --    , | Satisfy (Char -> Maybe a) --    , --   

, . .


, , . runAlt Alt .


(). , , , , . | . ( , . . . ). , - . , MonadPlus - , , . , .


, , . , !


')

Source: https://habr.com/ru/post/448644/


All Articles