Clojure - sequences

Clojure is a Lisp dialect, so it’s not at all surprising that the key to the language is given to working with lists. However, unlike traditional dialects (CL or Scheme), instead of the classical dual-slot lists, Clojure uses the Sequence abstraction - “logical list”. In fact, it is an interface that provides methods for working with immutable, lazy, and possibly infinite sequences. This article describes the internal structure of these entities.

The essence of the sequences

Let's start with java-interfaces. All structures in Clojure implement clojure.lang.Seqable :

public interface Seqable { ISeq seq(); }

In its turn, clojure.lang.ISeq is defined as:
')

 public interface ISeq extends IPersistentCollection { Object first(); ISeq next(); ISeq more(); ISeq cons(Object o); }

Here you can draw an analogy with Iterable , respectively, ISeq is a certain analogue of Iterator . Their main difference is that ISeq objects are immutable: its methods return a new instance (maybe even of a different type) instead of changing the internal state of the object. All sequences in Clojure are collections at the same time - they implement the inherited interface clojure.lang.IPersistentCollection , and through it Seqable :

 public interface IPersistentCollection extends Seqable { int count(); IPersistentCollection cons(Object o); IPersistentCollection empty(); boolean equiv(Object o); }

Here we see count - counting the number of elements. For sequences, its complexity is obviously O (n). The empty method returns an empty collection of the same type. For sequences, returns () . Well, the cons method adds a new element.

To get ISeq objects, the seq function is provided in the language. If the collection is empty, then nil is returned, if we have a Seqable object, then the seq () method is called on it, otherwise a special wrapper (adapter) is created. Wrappers are supported for arrays, iterators, java-collections (including Map ) and CharSequence objects (strings). Add your type to this list does not work - it is "sewn" in the java code (even the protocols will not help). Of course, you should not change arrays (or java-collections), if a sequence is created for them. So, the exception of ConcurrentModificationException is simply thrown upwards.

Clojure has only three basic functions for working with sequences:

first - take the first element of the sequence or nil if it is empty;
rest - get the "tail", i.e. a sequence without the first element;
cons - create a new immutable cons cell .

For adding new elements instead of cons , the conj function is usually used. The difference between them is that the first one always creates a new cell of the simply linked list, but the result of the second one is specific for each specific type of collections — the element is added to the most appropriate place. For cons-cells this is the beginning of the list, for vectors - the end, for sets the order is not defined at all. It is important to note that if we applied the seq collection, we lose information about its type:

 (conj [1 2 3] 4) ;=> [1 2 3 4] (conj (seq [1 2 3]) 4) ;=> '(4 1 2 3)

Unfortunately, the names of the methods in the java-interfaces and the names of the functions do not always coincide: the function conj corresponds to the method cons , and rest to the method more . On the other hand, you only need to know the names of the methods when writing your collections, and this occupation can hardly be attributed to the ordinary ones.

All standard functions automatically call seq . For example, if we write (cons 1 [2 3]) , then the created cons cell will actually refer not to the vector itself [1 2] , but to the result (seq [1 2]) . The same trick is carried out with the work of the other functions (map, filter, etc.) - they all work with sequences, although they accept them as collection parameters.

Actually, only lists and cons-cells directly implement ISeq :

 (seq? '(1 2)) ;=> true (seq? ()) ;=> true (seq? (cons 1 nil)) ;=> true (seq? (cons 1 [2 3])) ;=> true (seq? [1 2]) ;=> false (seq? {:a :b}) ;=> false (seq? #{1 2}) ;=> false (seq? nil) ;=> false

Therefore, when iterating through the list, no temporary objects are created. But when we go over the vector, then one temporary instance of ISeq is created for each element. In fact, with version 1.1, things are a little different, but more on that below.

There is the “unknown” interface clojure.lang.IndexedSeq , which is implemented by sequences for arrays and strings. It defines the index () method to get the number of the current item (i.e. where the head is in the original collection).

 (.index (rest (to-array [1 2 3 4]))) ;=> 1 (.index (rest (rest (to-array [1 2 3 4]))) ;=> 2

In practice, this interface is not used anywhere (undocumented feature). However, it is really difficult to invent it. By the way, the clojure.lang.APersistenVector $ Seq objects also implement it. But the trouble is that now the seq for vectors returns instances of clojure.lang.PersistentVector $ ChunkedSeq , which no longer have such a dubious possibility.

Lazy sequences

An important feature of Clojure sequences is their potential laziness. This means that at the current moment in time, not the entire sequence has been built, and its “tail” has not yet been created. Clojure often uses this for I / O threading.

Implementing laziness is quite simple. There is a class clojure.lang.LazySeq (which, of course, implements ISeq ). Objects of this class have a reference to a function (an IFn instance), which, when called, should return the new sequence itself (most likely also lazy, though not necessarily). When any call to LazySeq methods is made, a synchronized call to the generating function is made, its result is saved, and the link to the function itself is reset (to allow the garbage collector to process it). Obviously, there is no way to get an element of a lazy sequence without calculating all the previous ones. If you need exactly this behavior - you can use the delay macro.

 (nth (map #(print % "") (iterate inc 0)) 5) ;=> 1 2 3 4 5 @(nth (map #(delay (print % "")) (iterate inc 0)) 5) ;=> 5

It is ridiculously easy to generate lazy sequences in your code. It is enough to put the code that calls cons into the lazy-seq macro. But, in fact, even this is often not the case — most standard functions (with very few exceptions) return precisely the lazy collections, doing all the dirty work for us. There is truth and small nuances here.

So, in its calculation, a lazy sequence actually turns into a simple, simply-connected list, and if you keep a reference to the first element, then there may be problems with memory. Classic example:

 (def all-odds (filter odd? (range))) (println (nth all-odds 20000000))

Here we are trying to find the 20,000,000th odd number. And it finds it, but here only all intermediate instances of ISeq remain in memory. Correct solution:

 (defn make-all-odds [] (filter odd? (range))) (println (nth (make-all-odds) 2000000))

The rest function never returns nil , instead it can return an empty sequence. If we want to get nil , then we should use next .

 (first '(1 2 3)) ;=> 1 (rest '(1 2 3)) ;=> '(2 3) (rest '(1)) ;=> () (rest ()) ;=> () (next '(1)) ;=> nil (next ()) ;=> nil

This is done for more laziness. In fact, if the sequence is lazy, then in general we do not know whether there are more elements in it or not. To check this fact with the help of next, we need to calculate at least 2 first elements: we discard the first one, and by the presence of the second we will understand whether it is necessary to return nil . Well, the calculation of 2 elements is clearly not what we usually need.

If lazy behavior is not required, then the entire sequence can be fully computed using the dorun and doall functions . The first returns the original sequence, the second returns nil.

 (dorun (map #(println % "") (range 5)));=> 1 2 3 4 5

Block sequences (chunked sequences)

Lazy collections are a very powerful tool, but in practice, we often process the entire collection as a whole, and lazy-seq introduces significant overhead. Therefore, in version 1.1, an important optimization was done - chunked seqences. The essence is simple - when processing lazy sequences, we are dealing not with individual elements, but with their groups. In this case, a number of elements “for growth” are eagerly calculated.

Guess what this code displays?

 (first (map #(print % "") (range 100)))

Answer

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

So, Clojure defines a special interface clojure.lang.IChunkedSeq . Describing the java-part of this mechanism does not make much sense, so I will omit it. The interface implements only sequences for vectors. Well, as we have seen, the result of the range function.

The algorithm for processing such sequences:

check if the source is IChunkedSeq ( chunked-seq function ? );
if not, then perform the normal version of the algorithm (elementwise processing);
if yes, then take the first block from the sequence ( chunk-first function);
we watch its size ( chunk-size function);
we create a new block, we will place the generated elements there (the chunk-buffer function);
we actually do useful work, we add elements to the block (the chunk-append function);
we wrap our new block in ChunkedCons ( chunk-cons function).

For example, let's write our own implementation # (filter odd?%). Simple option:

 (defn filter-odd-1 [a] (lazy-seq (when-let [s (seq a)] (let [f (first s) r (rest s)] (if (odd? f) (cons f (filter-odd-1 r)) (filter-odd-1 r))))))

Extended version (with block support):

 (defn filter-odd-2 [a] (lazy-seq (when-let [s (seq a)] (if (chunked-seq? s) (let [f (chunk-first s) c (count f) b (chunk-buffer c)] (dotimes [ic] (let [x (nth fi)] (when (odd? x) (chunk-append bx)))) (chunk-cons (chunk b) (filter-odd-2 (chunk-rest s)))) (let [f (first s) r (rest s)] (if (odd? f) (cons f (filter-odd-2 r)) (filter-odd-2 r)))))))

Of course, the code has become a little bit more, and there is an obvious duplication on the face: we had to create two separate implementations. But this is the price of speed. Fortunately, most of the standard features already support the chunks to the full. I note that the sequence can consist of blocks of different lengths (blocks are not glued together, but sometimes are cut down, as in the example above), and even be a mixture of ordinary elements and blocks.

What if we want to turn an ordinary sequence into a block sequence? You just need to convert it into a vector using the function vec . The reverse is a little more complicated:

 (defn seq1 [s] (lazy-seq (when-let [x (seq s)] (cons (first x) (seq1 (rest x)))))) (first (map println (seq1 (range 1000)))) ;=> 1

Alternative solution

 (defn seq1 [^clojure.lang.ISeq s] (reify clojure.lang.ISeq (first [_] (.first s)) (more [_] (seq1 (.more s))) (next [_] (let [sn (.next s)] (and sn (seq1 sn)))) (seq [_] (let [ss (.seq s)] (and ss (seq1 ss)))) (count [_] (.count s)) (cons [_ o] (.cons so)) (empty [_] (.empty s)) (equiv [_ o] (.equiv so))))

It is quite expected that the reduce function takes advantage of block sequences. So, each block has a special method that performs convolution in a java-loop without creating temporary objects. In general, reduce has other optimizations as well; anyone interested can check clojure.core.protocols .

Instead of conclusion

Clojure sequences are powerful and convenient. Their undoubted plus is that one can often not even think about such “trifles” as laziness or block processing - Clojure tries to do it for us.

Source: https://habr.com/ru/post/160235/

All Articles

Clojure - sequences

The essence of the sequences

Lazy sequences

Block sequences (chunked sequences)

Instead of conclusion

More articles: