Clojure is a Lisp dialect, so it’s not at all surprising that the key to the language is given to working with lists. However, unlike traditional dialects (CL or Scheme), instead of the classical dual-slot lists, Clojure uses the Sequence abstraction - “logical list”. In fact, it is an interface that provides methods for working with immutable, lazy, and possibly infinite sequences. This article describes the internal structure of these entities.
The essence of the sequences
Let's start with java-interfaces. All structures in Clojure implement
clojure.lang.Seqable :
public interface Seqable { ISeq seq(); }
In its turn,
clojure.lang.ISeq is defined as:
')
public interface ISeq extends IPersistentCollection { Object first(); ISeq next(); ISeq more(); ISeq cons(Object o); }
Here you can draw an analogy with
Iterable , respectively,
ISeq is a certain analogue of
Iterator . Their main difference is that
ISeq objects
are immutable: its methods return a new instance (maybe even of a different type) instead of changing the internal state of the object. All sequences in Clojure are collections at the same time - they implement the inherited interface
clojure.lang.IPersistentCollection , and through it
Seqable :
public interface IPersistentCollection extends Seqable { int count(); IPersistentCollection cons(Object o); IPersistentCollection empty(); boolean equiv(Object o); }
Here we see
count - counting the number of elements. For sequences, its complexity is obviously O (n). The
empty method returns an empty collection of the same type. For sequences, returns
() . Well, the
cons method adds a new element.
To get
ISeq objects, the
seq function is provided in the language. If the collection is empty, then
nil is returned, if we have a
Seqable object, then the
seq () method is called on it, otherwise a special wrapper (adapter) is created. Wrappers are supported for arrays, iterators, java-collections (including
Map ) and
CharSequence objects (strings). Add your type to this list does not work - it is "sewn" in the java code (even the protocols will not help). Of course, you should not change arrays (or java-collections), if a sequence is created for them. So, the exception of
ConcurrentModificationException is simply thrown upwards.
Clojure has only three basic functions for working with sequences:
- first - take the first element of the sequence or nil if it is empty;
- rest - get the "tail", i.e. a sequence without the first element;
- cons - create a new immutable cons cell .
For adding new elements instead of
cons , the
conj function is usually used. The difference between them is that the first one always creates a new cell of the simply linked list, but the result of the second one is specific for each specific type of collections — the element is added to the most appropriate place. For cons-cells this is the beginning of the list, for vectors - the end, for sets the order is not defined at all. It is important to note that if we applied the
seq collection, we lose information about its type:
(conj [1 2 3] 4) ;=> [1 2 3 4] (conj (seq [1 2 3]) 4) ;=> '(4 1 2 3)
Unfortunately, the names of the methods in the java-interfaces and the names of the functions do not always coincide: the function
conj corresponds to the method
cons , and
rest to the method
more . On the other hand, you only need to know the names of the methods when writing your collections, and this occupation can hardly be attributed to the ordinary ones.
All standard functions automatically call
seq . For example, if we write
(cons 1 [2 3]) , then the created cons cell will actually refer not to the vector itself
[1 2] , but to the result
(seq [1 2]) . The same trick is carried out with the work of the other functions (map, filter, etc.) - they all work with sequences, although they accept them as collection parameters.
Actually, only lists and cons-cells directly implement
ISeq :
(seq? '(1 2)) ;=> true (seq? ()) ;=> true (seq? (cons 1 nil)) ;=> true (seq? (cons 1 [2 3])) ;=> true (seq? [1 2]) ;=> false (seq? {:a :b}) ;=> false (seq? #{1 2}) ;=> false (seq? nil) ;=> false
Therefore, when iterating through the list, no temporary objects are created. But when we go over the vector, then one temporary instance of
ISeq is created for each element. In fact, with version 1.1, things are a little different, but more on that below.
There is the “unknown” interface
clojure.lang.IndexedSeq , which is implemented by sequences for arrays and strings. It defines the
index () method to get the number of the current item (i.e. where the head is in the original collection).
(.index (rest (to-array [1 2 3 4]))) ;=> 1 (.index (rest (rest (to-array [1 2 3 4]))) ;=> 2
In practice, this interface is not used anywhere (undocumented feature). However, it is really difficult to invent it. By the way, the
clojure.lang.APersistenVector $ Seq objects also implement it. But the trouble is that now the
seq for vectors returns instances of
clojure.lang.PersistentVector $ ChunkedSeq , which no longer have such a dubious possibility.
Lazy sequences
An important feature of Clojure sequences is their potential laziness. This means that at the current moment in time, not the entire sequence has been built, and its “tail” has not yet been created. Clojure often uses this for I / O threading.
Implementing laziness is quite simple. There is a class
clojure.lang.LazySeq (which, of course, implements
ISeq ). Objects of this class have a reference to a function (an
IFn instance), which, when called, should return the new sequence itself (most likely also lazy, though not necessarily). When any call to
LazySeq methods
is made, a synchronized call to the generating function is made, its result is saved, and the link to the function itself is reset (to allow the garbage collector to process it). Obviously, there is no way to get an element of a lazy sequence without calculating all the previous ones. If you need exactly this behavior - you can use the
delay macro.
(nth (map #(print % "") (iterate inc 0)) 5)
It is ridiculously easy to generate lazy sequences in your code. It is enough to put the code that calls
cons into the
lazy-seq macro. But, in fact, even this is often not the case — most standard functions (with very few exceptions) return precisely the lazy collections, doing all the dirty work for us. There is truth and small nuances here.
So, in its calculation, a lazy sequence actually turns into a simple, simply-connected list, and if you keep a reference to the first element, then there may be problems with memory. Classic example:
(def all-odds (filter odd? (range))) (println (nth all-odds 20000000))
Here we are trying to find the 20,000,000th odd number. And it finds it, but here only all intermediate instances of
ISeq remain in memory. Correct solution:
(defn make-all-odds [] (filter odd? (range))) (println (nth (make-all-odds) 2000000))
The
rest function never returns
nil , instead it can return an empty sequence. If we want to get
nil , then we should use
next .
(first '(1 2 3)) ;=> 1 (rest '(1 2 3)) ;=> '(2 3) (rest '(1)) ;=> () (rest ()) ;=> () (next '(1)) ;=> nil (next ()) ;=> nil
This is done for more laziness. In fact, if the sequence is lazy, then in general we do not know whether there are more elements in it or not. To check this fact with the help of
next, we need to calculate at least 2 first elements: we discard the first one, and by the presence of the second we will understand whether it is necessary to return
nil . Well, the calculation of 2 elements is clearly not what we usually need.
If lazy behavior is not required, then the entire sequence can be fully computed using the
dorun and
doall functions . The first returns the original sequence, the second returns nil.
(dorun (map #(println % "") (range 5)))
Block sequences (chunked sequences)
Lazy collections are a very powerful tool, but in practice, we often process the entire collection as a whole, and
lazy-seq introduces significant overhead. Therefore, in version 1.1, an important optimization was done - chunked seqences. The essence is simple - when processing lazy sequences, we are dealing not with individual elements, but with their groups. In this case, a number of elements “for growth” are eagerly calculated.
Guess what this code displays?
(first (map #(print % "") (range 100)))
Answer0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
So, Clojure defines a special interface
clojure.lang.IChunkedSeq . Describing the java-part of this mechanism does not make much sense, so I will omit it. The interface implements only sequences for vectors. Well, as we have seen, the result of the
range function.
The algorithm for processing such sequences:
- check if the source is IChunkedSeq ( chunked-seq function ? );
- if not, then perform the normal version of the algorithm (elementwise processing);
- if yes, then take the first block from the sequence ( chunk-first function);
- we watch its size ( chunk-size function);
- we create a new block, we will place the generated elements there (the chunk-buffer function);
- we actually do useful work, we add elements to the block (the chunk-append function);
- we wrap our new block in ChunkedCons ( chunk-cons function).
For example, let's write our own implementation # (filter odd?%). Simple option:
(defn filter-odd-1 [a] (lazy-seq (when-let [s (seq a)] (let [f (first s) r (rest s)] (if (odd? f) (cons f (filter-odd-1 r)) (filter-odd-1 r))))))
Extended version (with block support):
(defn filter-odd-2 [a] (lazy-seq (when-let [s (seq a)] (if (chunked-seq? s) (let [f (chunk-first s) c (count f) b (chunk-buffer c)] (dotimes [ic] (let [x (nth fi)] (when (odd? x) (chunk-append bx)))) (chunk-cons (chunk b) (filter-odd-2 (chunk-rest s)))) (let [f (first s) r (rest s)] (if (odd? f) (cons f (filter-odd-2 r)) (filter-odd-2 r)))))))
Of course, the code has become a little bit more, and there is an obvious duplication on the face: we had to create two separate implementations. But this is the price of speed. Fortunately, most of the standard features already support the chunks to the full. I note that the sequence can consist of blocks of different lengths (blocks are not glued together, but sometimes are cut down, as in the example above), and even be a mixture of ordinary elements and blocks.
What if we want to turn an ordinary sequence into a block sequence? You just need to convert it into a vector using the function
vec . The reverse is a little more complicated:
(defn seq1 [s] (lazy-seq (when-let [x (seq s)] (cons (first x) (seq1 (rest x)))))) (first (map println (seq1 (range 1000))))
Alternative solution (defn seq1 [^clojure.lang.ISeq s] (reify clojure.lang.ISeq (first [_] (.first s)) (more [_] (seq1 (.more s))) (next [_] (let [sn (.next s)] (and sn (seq1 sn)))) (seq [_] (let [ss (.seq s)] (and ss (seq1 ss)))) (count [_] (.count s)) (cons [_ o] (.cons so)) (empty [_] (.empty s)) (equiv [_ o] (.equiv so))))
It is quite expected that the
reduce function takes advantage of block sequences. So, each block has a special method that performs convolution in a java-loop without creating temporary objects. In general,
reduce has other optimizations as well; anyone interested can check
clojure.core.protocols .
Instead of conclusion
Clojure sequences are powerful and convenient. Their undoubted plus is that one can often not even think about such “trifles” as laziness or block processing - Clojure tries to do it for us.