📜 ⬆️ ⬇️

Sorting large amounts of data, implementation in Java

Recently on Habré there was article Sorting of a million 32-bit int'ov in 2 megabytes of memory on the Python . Interested, tried to implement in Java.

Of course, 32 lines failed. It turned out 235.
But it seemed to me that the result could well be used as the first post - do not judge strictly;)


At first, we got a trial version, which worked with arrays of integers, stored them in a binary file. After it earned (by the way, 2.3 seconds on p4-3.2), it was thought that a hard peg to architecture and types is not the right way. After all, it was not without reason that collections made in Java that work with any types of data.
')
After spending 2 hours, I got a completely working class, digesting any Comparable-types.

What it consists of:


Test - class to test performance. With unit tests, I didn't bother.

FileSort - in fact, the main worker, cook and carpenter.

FileSortStorageObject - an assistant that saves a list of objects in a temporary file.

FileSortStorage - assistant interface. In fact, serialization (ObjectOutputStream and ObjectInputStream) slows down the algorithm a little, so anyone can implement a different save method.

How it works:


Since there is a lot of data, there is no sense to represent them as a full array. It seemed to me that you can work with them with the help of an iterator.

1. With the help of an iterator, the part of the data that is stored in the memory is read (the number of records is specified).
2. The read block is sorted by the same quicksort (its memory overhead is not large).
3. Next, the sorted block is added to its temporary file, where it lies for the time being.
4. While there is data, repeat P1-3
5. Here we have N files, each of which contains a sorted portion of data.
6. An iterator is created that elementarily merges the existing files.

When merging, iterators are also used (what I do, I like them), which select the next record from each individual file. From the selected entries is selected minimum, which is returned to the outside. The next object from the corresponding file is immediately counted to the place of the entry issued to the outside.

So, the source. Suddenly someone wants to play


Main class:


package ru.dchekmarev.test.FileSort;

import java.io.*;
import java.util.*;

/**
*
*/

public class FileSort<T extends Comparable<T>> implements Iterable<T> {
private int bufferSize = 10000;
private List<FileSortStorage> partFiles = new LinkedList<FileSortStorage>();
private Iterator<T> source;
private List<T> part = new LinkedList<T>();
/**
* ,
*/

public FileSort() {
}
/**
* -
*/

public FileSort(Iterator<T> newSource) {
setSource(newSource);
}
/**
* -
*/

public FileSort(Iterator<T> newSource, Integer newSize) {
this(newSource);
setBufferSize(newSize);
}
/**
*
*/

public void setBufferSize( int newSize) {
bufferSize = newSize;
}
/**
* ,
*/

public void setSource(Iterator<T> newSource) {
source = newSource;
try {
sortParts();
} catch (IOException e) {
throw new RuntimeException(e);
}
Collections.sort(part);
}
/**
*
*/

public Iterator<T> iterator() {
if (partFiles.size() == 0) {
// ,
return part.iterator();
}
return new Iterator<T>() {
Long t = 0L;
List<T> items = new ArrayList<T>();
List<Iterator<T>> iterators = new ArrayList<Iterator<T>>();
Integer minIdx = null ;
// ,
// ,
{
iterators.add(part.iterator());
for (FileSortStorage f : partFiles) {
iterators.add(f.iterator());
}
for (Iterator<T> item : iterators) {
if (item.hasNext()) {
items.add(item.next());
} else {
throw new RuntimeException( "failed to get first for iterator" );
}
}
}
/**
* ,
*/

public boolean hasNext() {
if (minIdx == null ) {
for ( int i = 0; i < items.size(); i++) {
if (items.get(i) != null &&
(minIdx == null ||
items.get(i).compareTo(items.get(minIdx)) < 0)) {
minIdx = i;
}
}
}
return minIdx != null ;
}
/**
* - ,
*
*/

public T next() {
T res = null ;
if (hasNext()) {
res = items.get(minIdx);
if (iterators.get(minIdx).hasNext()) {
items.set(minIdx, iterators.get(minIdx).next());
} else {
items.set(minIdx, null );
}
}
minIdx = null ;
return res;
}
public void remove() {
throw new UnsupportedOperationException();
}
};
}
/**
*
*/

void sortParts() throws IOException {
while (source.hasNext()) {
part.add((T)source.next());
if (part.size() >= bufferSize && source.hasNext()) {
Collections.sort(part);
partFiles.add( new FileSortStorageObject(part));
part.clear();
}
}
}
}



Interface to save to disk:



package ru.dchekmarev.test.FileSort;

import java.util.List;
import java.io.IOException;

public interface FileSortStorage<T> extends Iterable<T> {
public void setObjects(List<T> objects) throws IOException;
}



The implementation of saving on disk serialized objects:



package ru.dchekmarev.test.FileSort;

import java.io.*;
import java.util.*;

/**
*
*/

public class FileSortStorageObject<T> implements FileSortStorage<T> {
private final File file;

/**
* ,
*/


public FileSortStorageObject(List<T> objects) throws IOException {
file = File.createTempFile( "FileSort" , "dat" );
setObjects(objects);
}
/**
*
*/

public void setObjects(List<T> objects) throws IOException {
ObjectOutputStream wr = new ObjectOutputStream( new FileOutputStream(file));
for (T item : objects) {
wr.writeObject(item);
}
wr.close();
}
/**
* -
*/


public Iterator<T> iterator() {
try {
return new Iterator<T>() {
private ObjectInputStream fr =
new ObjectInputStream( new FileInputStream(file));
T obj;
public boolean hasNext() {
if (obj == null ) {
try {
obj = (T)fr.readObject();
} catch (ClassNotFoundException e) {
throw new RuntimeException(e);
} catch (IOException e) {
obj = null ;
}
}
return obj != null ;
}
public T next() {
hasNext();
T res = obj;
obj = null ;
return res;
}
public void remove() {
throw new UnsupportedOperationException();
}
};
} catch (IOException e) {
throw new RuntimeException(e);
}
}
/**
*
*/


protected void finalize() {
file.delete();
}
}


And the test class, because you always want to see a clear result:



package ru.dchekmarev.test.FileSort;

import java.util.*;

public class Test {
public static void main(String[] args) {
System.out.println( "Test start" );
// -

FileSort<Double> sort = new FileSort<Double>(
// -
//
new Iterator<Double>() {
private int i = 0;
private Random rand = new Random();
public boolean hasNext() {
if (i >= 1000) {
System.out.println( "generator finish" );
}
return i < 1000;
}
public Double next() {
i++;
return rand.nextDouble();
}
public void remove() {
throw new UnsupportedOperationException();
}
});
int i = 0;
//

for (Double res : sort) {
if (++i % 10000 == 0) {
//
//
System.out.println(i);
}
System.out.println( " == " + res);
}
}
}



Code ... Yes, you could write literate.
Algorithm ... Yes, for sure it is not correct. But it works :)
my original

Source: https://habr.com/ru/post/65612/


All Articles