I bring to your attention the translation of a wonderful fresh article by Justin Le. In his blog in Code, this author speaks in a fairly easy way about the mathematical essence of beautiful and elegant functional solutions for practical problems. This article examines in detail an example of how the transfer of the mathematical structure, which the data in the domain forms to the program type system, can immediately, as Gerald and Sassman wrote "automagically", lead to a working solution.
The code shown in the picture is a complete, self-contained, extensible implementation of the regular expression parser, written from scratch. The highest class, the real magic of types!
Today we implement applicative regular expressions and parsers (in the spirit of the regex-applicative library) using free algebraic structures! Free structures are one of my favorite tools in Haskell, and I already wrote earlier about free groups , variations on the theme of free monads, and about the "free" applicative functor on monoids .
Regular expressions (and parsers for them) are ubiquitous in programming and computer science, so I hope, with my demonstration of how simple they are implemented using free structures, I will help the reader to appreciate the merits of this approach without fear of getting bogged down in unnecessary details.
All the code in this article is available online in the form of “stack executable”. If you start it ( ./regexp.hs
), the GHCi session will be launched with all the definitions, so you will have the opportunity to play with the functions and their types.
This article will be quite understandable to the “advanced novice” or “novice specialist” in Haskell. It requires knowledge of the basic concepts of a language: pattern matching, algebraic data types, and abstractions such as monoids, functors, and do-notations.
A regular expression is a way to define some regular language. Formally, such an expression is built from three basic elements:
And also from three operations:
RS
, sequence of expressions. The product of sets (Cartesian).R|S
, choice between expressions. Union of sets.R*
, the expression is repeated an arbitrary number of times (including zero).And this is all that makes up regular expressions, no more, no less. From these basic components, you can construct all other known operations on regular expressions — for example, a+
can be expressed as aa*
, and categories like \w
can be represented as an alternative to suitable characters.
The given minimal definition of a regular language is sufficiently complete for a mathematician, but impractical. For example, the operation of negation or addition (“any character except the specified one”) can be written within the framework of the basic definition, but its direct application will lead to an exponential growth of the resources used.
When looking at the structure of regular expressions, does it seem familiar? It reminds me very much of the Alternative
type class. If the functor belongs to this class, then this means that for it is defined:
pure x
is a single element (from the Applicative
class).<*>
operation organizing sequential calculations.<|>
, organizing alternative calculations.many
function is an operation of repeating the calculations zero or more times.All this is very similar to the construction of a regular language, isn't it? Perhaps the alternative functor is almost what we need, the only thing missing is the primitive corresponding to the literal symbol.
Anyone not familiar with the Alternative
class can find a good introduction to Typeclassopedia . But within our article, this class is simply a "double monoid" with two ways of combining <*>
and <|>
, which, in a sense, can be compared with the operations *
and +
for numbers. In general, for the definition of an alternative functor, the five points listed above and some additional laws of commutativity and distributivity are sufficient.
To be precise, the author got a little excited with the "double monoid". The class Alternative
extends the applicative functor, which is (under certain restrictions) a semigroup, to a semiring, where the addition operation <|>
with the neutral element empty
plays the role of a commutative monoid. Application operator
(<*>) :: Applicative f => f (a -> b) -> fa -> fb
cannot act as an analogue of the multiplication operation in a semiring, because it does not even form magma . However, along with the <*>
operator, the "one-sided" *>
and <*
operators are defined in the Control.Applicative
package. Each of them ignores the result of the work of the operand, which does not show the "corner":
(<*) :: Applicative f => fa -> fb -> fa (*>) :: Applicative f => fa -> fb -> fb
If the types a
and b
coincide, then with these operations we get a semigroup (associativity follows from the properties of the composition). It can be verified that for an alternative functor, multiplication is distributive with respect to addition, both on the right and on the left, and, moreover, the neutral element for addition (analogue of zero) is an absorbing element for the multiplication operation.
Semi-rings also form numbers, sets, matrices of semirings, algebraic types, and ... regular expressions, so, in fact, we are talking about the same algebraic structure.
Thus, we can consider regular expressions as an alternative functor, plus a primitive for a literal character. But there is another way to look at them, and it leads us straight to free structures. Instead of "an alternative functor with literals", we can turn the literal into an instance of the class Alternative
.
Let's write this. Type for primitive literal:
data Prim a = Prim Char a deriving Functor
Note that since we work with functors (applicative, alternative), then a certain “result” will be associated with all our regular expressions. This is due to the fact that when defining an instance for the classes Functor
, Applicative
and Alternative
we must have type-parameter.
On the one hand, you can ignore this type, but on the other hand, you should use this value as a result of matching with a regular expression, as is done in industrial applications that work with regular games.
In our case, Prim 'a' 1 :: Prim Int
will represent the primitive that is mapped to the character 'a'
, and is immediately interpreted, resulting in a one.
Well, now ... let's give our primitive the necessary mathematical structure using the free alternative functor from the free
library:
import Control.Alternative.Free type RegExp = Alt Prim
That's all! This is our full-fledged type for regular expressions! By declaring the Alt
type as an instance of the Functor
class, we obtained all the operations from the Applicative
and Alternative
classes, since in this case there are instances of Applicative (Alt f)
and Alternative (Alt f)
. Now we have:
empty
from class Alternative
pure
from the Applicative
class.Prim
Base Prim
<*>
from the Applicative
class<|>
from class Alternative
many
from class Alternative
And we got all this completely "free", that is, "for free"!
Essentially, a free structure automatically provides us with only an abstraction for the base type and nothing more. But regular expressions, by themselves, also represent only a structure: the basic elements and a set of operations, no more, no less, so a free alternative functor provides us with exactly what we need. Not more, but not less.
After adding some handy wrapper functions ... the work on the type is complete!
-- | charAs: , charAs :: Char -> a -> RegExp a charAs cx = liftAlt (Prim cx) -- liftAlt :: fa -> Alt fa -- Prim RegExp -- | char: char :: Char -> RegExp Char char c = charAs cc -- | string: string :: String -> RegExp String string = traverse char -- , ?
Well, what, try? Let's construct the expression (a|b)(cd)*e
, returning, in case of successful matching, the unit type ()
:
testRegExp_ :: RegExp () testRegExp_ = void $ (char 'a' <|> char 'b') *> many (string "cd") *> char 'e'
The function void :: Functor f => fa -> f ()
from the Data.Functor
package discards the result, we use it, because here we are only interested in the success of the comparison. But the operators <|>
, *>
and many
are used by us exactly as it is assumed during concatenation or selection of one of the options.
Here is an interesting example more difficult, let's define the same regular expression, but now, as a result of matching, we count the number of repetitions of the substring cd
.
testRegExp :: RegExp Int testRegExp = (char 'a' <|> char 'b') *> (length <$> many (string "cd")) <* char 'e'
There is a subtlety in the work of the operators *>
and <*
: arrows show the result that should be kept. And since many (string "cd") :: RegExp [String]
returns a list of duplicate elements, we can, staying inside the functor, calculate the length of this list, getting the number of repetitions.
Moreover, the GHC compiler -XApplicativeDo
allows -XApplicativeDo
to write our expression using do-notation, which is probably easier to understand:
testRegExpDo :: RegExp Int testRegExpDo = do char 'a' <|> char 'b' cds <- many (string "cd") char 'e' pure (length cds)
It's all in something like how we “capture” the result of parsing a string using a regular expression, accessing it. Here is an example in Ruby:
irb> /(a|b)((cd)*)e/.match("acdcdcdcde")[2] => "cdcdcdcd"
with the only difference that we added some post-processing to calculate the number of repetitions.
Here is another convenient regular \d
, corresponding to a digit from 0 to 9 and returning a number:
digit :: RegExp Int digit = asum [ charAs (intToDigit i) i | i <- [0..9] ]
Here, the asum
function from the Control.Applicative.Alternative
package represents a choice from the elements of the asum [x,y,z] = x <|> y <|> z
intToDigit
, and the intToDigit
function intToDigit
defined in the Data.Char
package. And, again, we can create quite elegant things, for example, the expression \[\d\]
corresponding to a number in square brackets:
bracketDigit :: RegExp Int bracketDigit = char '[' *> digit <* char ']'
Well, well, all we did was describe the data type for the literal with concatenation, choices and repetitions. Fine! But what we really need is a matching string with a regular expression, right? How will the free alternative functor help us in this? In fact, significantly help. Let's look at two ways to do it!
What is "freedom"?
The canonical way to use a free structure is to wrap it into a specific structure using a suitable algebra. For example, the foldMap
transformation turns a free monoid (list) into the value of any instance of the Monoid
class:
foldMap :: Monoid m => (a -> m) -> ([a] -> m)
The foldMap
function turns the transformation a -> m
into the transformation [a] -> m
(or, FreeMonoid a -> m
), with a specific monoid m
. The general idea is that using a free structure allows you to postpone its specific use “for later”, separating the creation time and the time it takes to use the structure.
For example, we can construct a free monoid of numbers:
-- | "" `Int` `Int`, `liftAlt`. liftFM :: Int -> [Int] liftFM x = [x] myMon :: [Int] myMon = liftFM 1 <> liftFM 2 <> liftFM 3 <> liftFM 4
And now we can decide how we want to interpret the <>
operation:
Maybe this addition?
ghci> foldMap Sum myMon Sum 10 -- 1 + 2 + 3 + 4
Or multiplication?
ghci> foldMap Product myMon Product 24 -- 1 * 2 * 3 * 4
Or maybe the calculation of the maximum number?
ghci> foldMap Max myMon Max 4 -- 1 `max` 2 `max` 3 `max` 4
The idea is to postpone the choice of a particular monoid, first constructing a free collection of numbers 1, 2, 3, and 4. A free monoid on numbers defines such a structure over them as needed, no more, no less. To use foldMap
we specify "how to perceive the base type" by passing the <>
operator to a particular monoid.
Interpretation in the State
functor
In practice, obtaining a result from a free structure consists in finding (or creating) a suitable functor that will provide us with the desired behavior. In our case, we are lucky, there is a concrete implementation of the class Alternative
, which works exactly as we need: StateT String Maybe
.
The product <*>
for this functor consists in organizing a sequence of state changes. In our case, under the state, we will consider the remainder of the line being parsed, so that the consistent parsing is the best possible match for the operation <*>
.
And its sum <|>
works as backtracking, search with return to alternative in case of failure. It saves the state since the last successful parsing and returns to it when an alternative is unsuccessful. This is exactly the behavior that we expect from the expression R|S
Finally, the natural transformation for a free alternative functor is called runAlt
:
runAlt :: Alternative f => (forall b. pb -> fb) -> Alt pa -> fa
Or, for the RegExp type:
runAlt :: Alternative f => (forall b. Prim b -> fb) -> RegExp a -> fa
If you are not familiar with RankN
types (with the forall b.
Construction), then you can see a good introduction here . The point here is that you need to provide a runAlt
function that works with Prim b
for absolutely any b
, and not for any particular type, like Int
and Bool
, for example. That is, as with working with foldMap
we only need to specify what to do with the base type. In our case, answer the question: "What should be done with the Prim
type?"
processPrim :: Prim a -> StateT String Maybe a processPrim (Prim cx) = do d:ds <- get guard (c == d) put ds pure x
This is the interpretation of Prim
as an action in the context of a StateT String Maybe
, where the state is an unassembled string. Recall that Prim
contains information about the matching character c
and its interpretation in the form of some value x
. Prim
processing consists of the following steps:
get
state (the part of the line that was not parsed yet), and right there we pop out its first character and its remainder. If the string is empty, it will return with an alternative. ( The StateT
transformer acts inside the Maybe functor and if it is impossible to match the sample on the right side of the <-
operator inside the do block, the calculation will end with the result empty
, that is, Nothing
. Note. Lane ).empty
, and we proceed to an alternative option.Prim
should return.This function can already be used to match RegEx with a string prefix. To do this, you need to run calculations using runAlt
and runStateT
, passing the last string to the function to be parsed as an argument:
matchPrefix :: RegExp a -> String -> Maybe a matchPrefix re = evalStateT (runAlt processPrim re)
That's all! Let's see how our first solution works:
ghci> matchPrefix testRegexp_ "acdcdcde" Just () ghci> matchPrefix testRegexp_ "acdcdcdx" Nothing ghci> matchPrefix testRegexp "acdcdcde" Just 3 ghci> matchPrefix testRegexp "bcdcdcdcdcdcdcde" Just 7 ghci> matchPrefix digit "9" Just 9 ghci> matchPrefix bracketDigit "[2]" Just 2 ghci> matchPrefix (many bracketDigit) "[2][3][4][5]" Just [2,3,4,5] ghci> matchPrefix (sum <$> many bracketDigit) "[2][3][4][5]" Just 14
Wait, what was that?
It seems that everything happened a little faster than you expected. A minute ago we wrote our primitive, and then again! and a working parser is ready. Here, in fact, all the key code, a few lines in Haskell:
import Control.Monad.Trans.State (evalStateT, put, get) import Control.Alternative.Free (runAlt, Alt) import Control.Applicative import Control.Monad (guard) data Prim a = Prim Char a deriving Functor type RegExp = Alt Prim matchPrefix :: RegExp a -> String -> Maybe a matchPrefix re = evalStateT (runAlt processPrim re) where processPrim (Prim cx) = do d:ds <- get guard (c == d) put ds pure x
And do we have a full-featured regular expression parser? What happened?
Recall that at a high level of abstraction, Alt Prim
already contains pure
, empty
, Prim
, <*>
, <|>
, and many
in its structure (with this operator there is one unpleasant subtlety, but more about it later). What essentially does runAlt
- uses the behavior of a particular alternative functor (in our case, StateT String Maybe
) to control the behavior of the operators pure
, empty
, <*>
, <|>
, and many
. However, StateT
does not have a built-in operator, similar to Prim
, and for this we need to write a processPrim
.
So, for the Prim
type, the runAlt
function uses the runAlt
, and for pure
, empty
, <*>
, <|>
, and many
, the appropriate instance of the Alternative
class is used. Thus, the StateT
functor performs 83% of the work for us, and the remaining 17% is the processPrim
. In truth, this is somewhat disappointing. One may ask: why was it at all to start with the Alt
wrapper? Why not immediately determine the type RegExp = StateT String Maybe
and the appropriate primitive char :: Char -> StateT String Maybe Char
? If everything is done in the StateT StateT
, then why bother with Alt
, a free alternative functor?
The main advantage of Alt
over StateT
is that StateT
is ... quite a powerful tool. But in fact, it is powerful, to the point of absurdity. With it, you can represent a huge number of various calculations and structures, and, which is unpleasant, it is easy to imagine something that is not a regular expression. Let's say something elementary like put "hello" :: StateT String Maybe ()
does not match any regular regular expression, but it is of the same type as RegExp ()
. Thus, while we say that Alt Prim
matches a regular expression, not more, but not less, we cannot say the same about StateT String Maybe
. The Alt Prim
type is the perfect representation of a regular expression. Everything that can be expressed with its help is a regular expression, but something that is not such an expression with its help will not work. Here, however, there are also some unpleasant subtleties associated with Haskell's laziness, more on this later.
Here we can view StateT
only as a context used for one
Regular expression interpretations - in the form of a parser. But you can imagine other ways to use the type RegExp
. For example, we may need its textual representation; this is what StateT
will not allow.
We cannot say that StateT String Maybe
is a regular expression, only that this functor can represent a parser based on regular grammars. But about Alt Prim
we can say for sure that this is a regular expression ( as mathematicians say, they are equal to within isomorphism, approx. Lane ).
All this, of course, is very good, but what if we don’t want to shift 83% of the work to code for a type that was written by someone for us. Is it possible to use the free Alt
structure directly to write a parser? This question is similar to this: how to write a function that processes lists (by matching constructors (:)
and []
) instead of using only foldMap
? How to directly operate this structure instead of delegating work to a specific monoid?
Glad you asked! Let's take a look at the definition of a free alternative functor:
newtype Alt fa = Alt { alternatives :: [AltF fa] } data AltF fa = forall r. Ap (fr) (Alt f (r -> a)) | Pure a
This is an unusual type, defined through mutual recursion, so that it can look very confusing. One of the ways to understand it is to imagine that Alt xs
contains a chain of alternatives formed by the operator <|>
. And each such alternative is represented by the type AltF
, which is a sequence of functors f
, formed by the operator <*>
(as a sequence of nested functions).
It can be considered AltF fa
as a simply linked list [fr]
, with different r
for each element. Ap
corresponds to the constructor (:)
that contains fr
, and Pure
to the end of the list []
. The construction forall r.
here denotes a quantifier of existence from an extension, -XExistentialQuantification
and it is this one that allows connecting various types into a chain.
In the end, Alt f
like a list of alternatives, each of which represents a chain of applications. Or you can look at it as a normalized form of sequential (or nested) operations <*>
and <|>
, just as the type [a]
is a normalized form of sequential operations <>
.
, :
<>
: [a,b,c,d] = [a]<>[b]<>[c]<>[d]
+
, — , *
: a*(b+c)+d*(x+y+z)*h
<|>
, — , <*>
: fa <*> (fb <|> fc) <|> fd <*> (fx <|> fy <|> fz) <*> fh
, RegExp a -> String -> Maybe a
, , . : .
, Alt
. , , .
matchAlts :: RegExp a -> String -> Maybe a matchAlts (Alt res) xs = asum [ matchChain re xs | re <- res ]
asum :: [Maybe a] -> Maybe a
, Just
. ( , Maybe a
Alternative
— Nothing
, <|>
. . . )
. AltF
, Ap
Pure
:
matchChain :: AltF Prim a -> String -> Maybe a matchChain (Ap (Prim cx) next) cs = _ matchChain (Pure x) cs = _
" ": GHC "", , , . ( Haskell "" (holes) , _
, . . . ) :
matchChain :: AltF Prim a -> String -> Maybe a matchChain (Ap (Prim cx) next) cs = case cs of [] -> Nothing d:ds | c == d -> matchAlts (($ x) <$> next) ds | otherwise -> Nothing matchChain (Pure x) _ = Just x
Ap
( (:)
), , - . , . Prim r
, , next :: RegExp (r -> a)
. , next
. , "" , Nothing
. , Pure x
( []
), , , .
, , . , , " " Ap
, Pure
, AltF
.., .
:
ghci> matchAlts testRegexp_ "acdcdcde" Just () ghci> matchAlts testRegexp_ "acdcdcdx" Nothing ghci> matchAlts testRegexp "acdcdcde" Just 3 ghci> matchAlts testRegexp "bcdcdcdcdcdcdcde" Just 7 ghci> matchAlts digit "9" Just 9 ghci> matchAlts bracketDigit "[2]" Just 2 ghci> matchAlts (many bracketDigit) "[2][3][4][5]" Just [2,3,4,5] ghci> matchAlts (sum <$> many bracketDigit) "[2][3][4][5]" Just 14
. , () , . , , .
foldMap
. , , foldMap, , , , ! , — , — (:)
[]
.
, , : , , , (:)
, []
. , . , [1,2,3] <> [4]
, [1] <> [2,3] <> [4]
. , , .
. , :
data RegExp a = Empty | Pure a | Prim Char a | forall r. Seq (RegExp r) (RegExp (r -> a)) | Union (RegExp a) (RegExp a) | Many (RegExp a)
RegExp
, . . , RegExp
:
-- | a|(b|c) abc1 :: RegExp Int abc1 = Prim 'a' 1 `Union` (Prim 'b' 2 `Union` Prim 'c' 3) -- | (a|b)|c abc2 :: RegExp Int abc2 = (Prim 'a' 1 `Union` Prim 'b' 2) `Union` Prim 'c' 3
, .
Alt Prim
, , , . , matchAlts
, . (a|b)|c
a|(b|c)
. Alt
. , , .
, , (a|b)|c
, (a|b)|c
, , , RegExp
. Alt
, .
, , Alt
, Alt Prim
. , Alt Prim
a|a
a
. , Alt f
f
. , : . , , , .
, . , , . RegExp
, , — .
, Haskell. , - [a]
. ( , - , "" , - "" . approx. . )
: a|aa|aaa|aaaa|aaaaa|aaaaaa|...
, ( , , , . . . ). , Haskell . , Alt
many
. , a*
a|aa|aaa|aaaa|aaaaa|aaaaaa|...
, . - many (char 'a')
, . Haskell Alt
, .
, , , , (), . , many
.
, ! "" Alt
, Control.Alternative.Free.Final
, many
(, , ).
, , runAlt
. , Alternative
, many
( RE
regex-applicative ) . , Haskell , , , many
.
, . ( ), ( , , . . ). matchPrefix
, , , . , , , , . , GHC .
, , tails
( ) mapMaybe
( ). , , listToMaybe
.
matches :: RegExp a -> String -> [a] matches re = mapMaybe (matchPrefix re) . tails firstMatch :: RegExp a -> String -> Maybe a firstMatch re = listToMaybe . matches re
, , matchPrefix
Nothing
, listToMaybe
, Nothing
( , . . . ).
, . , , — , . , . , , , .
Alt Prim
, : , , , .
? . :
data Prim a = Only Char a -- | Letter a -- | Digit (Int -> a) -- , | Wildcard (Char -> a) -- , | Satisfy (Char -> Maybe a) -- , --
, . .
, , . runAlt
Alt
.
(). , , , , . |
. ( , . . . ). , - . , MonadPlus
- , , . , .
, , . , !
Source: https://habr.com/ru/post/448644/
All Articles