I recently ran into an interesting problem to solve for my side project: how can I efficiently select the top k elements from a very large list in Java / Groovy?
There are many recommendations about how to do that, but I didn’t find a comparison of the suggested implementations. This is a crucial problem for my application so I tried out a number of them and here are the results. I chose Groovy for it’s syntactical sugar with collections and closures, but you can do the same in plain Java. The code is on github, you can run it with a simple mvn compile groovy:execute if you want to play around yourself.
Task: select the top 5 elements from a list of 10 million
Plain sort: 6,500ms
Sorting the whole list and picking the top elements is okay for small lists, but it doesn’t scale for larger lists as the time for sorting a list grows with O(nlogn). For a list with 10 million numbers, it’s completely out of the competition.
topElements = unorderedList.sort()[numberCount-1..numberCount-k]
(QuickSelect: 2,500ms)
Also called Hoares Selection Algorithm, was mentioned in a few articles, so wanted to give it a try. However I only found implementations for educational purposes like this one. It’s probably not fair to compare this demo implementation with the highly optimized other ones that follow, so I put the result in brackets as I don’t want to discredit the algorithm.
[source too long to quote here, see article on brilliantsheep.com]
(Heap Select: 2,200ms)
Again, this is just a demonstration on how to implement a heap select based on Steve Hanov’s great article. I’ve ported his Python implementation to Groovy, which was easy enough. I’m sure there is a more efficient way to do a Heap Select – the old rule holds true: don’t try and implement it yourself if someone else brighter has done it already for you. Anyway, this was my naive attempt:
def heapSelect(List list, k) {
def heap = new PriorityQueue(k)
list.each{ item ->
if (heap.size() < k || item > heap.peek()) {
if (heap.size() == k)
heap.remove(heap.peek())
heap.offer(item)
}
}
return heap as List
}
PriorityQueue: 300ms
Ships with the JRE, so there’s no need for external libraries. It let’s you add a list of elements to a priority heap and then poll the top element from the heap one by one. Simple and fast!
heap = new PriorityQueue(unorderedList.size())
heap.addAll(unorderedList)
topElements = (1..k).collect{heap.poll()}
Guava Ordering: 170ms
Google Guava (formerly Google Collections) comes with an Ordering class that works even faster than the PriorityQueue for our task. Under the hood it seems to use QuickSort to sort only parts of a collection, however I haven’t dug too deep in their implementation.
An obstacle on using this could be that your data has to be in a structure that implements Iterable, e.g. an ArrayList. If you have a plain int[] and need to convert it first, you might be better off to go for a PriorityQueue in the first place. On the other hand, maybe an ArrayIterator can do this very efficiently.. haven’t tried it out though.
com.google.common.collect.Ordering.natural()
.greatestOf(arrayList, k)
By the way…
I often saw the suggestion to add the elements into a TreeMap which orders them for you. While this works fine it will never be the most efficient solution because it sorts all data in the map, whereas we’re only interested in the highest k items.
Please bear in mind that I put all this together at home on my own, so if you find any mistakes please leave a comment or drop me a mail and I’m happy to update this entry – and give you all the credit, of course ;)
Again: the code is on github – all it takes is mvn compile groovy:execute to run it on your machine.
14 Responses to “Selecting top k items from a list efficiently in Java / Groovy”
Sorry, the comment form is closed at this time.
Very useful! Thanks for sharing! One controversy though, is anyone brighter than yourself? ;)
For top x elements just pass once through the list/array and always keep the top x elements you need. You don’t need to sort anything.
That’s exactly what Heap Select does.
Heap sort works in different tway. After all, it sorts all the data. As stated above, you don’t need to sort all the set to select k max items. and you can’t outperform iteration without preparing data somehow.
While a ‘Heap Sort’ sorts all the data, the above mentioned ‘Heap Select’ does not. It iterates over the list once and keeps the top k elements on a Heap, just as the commenter ‘name’ suggested. The article I linked above (Steve Hanov’s blog) explains it very well.
Funny how you went full-circle.
Your implementation of Hoares Selection is the slowest of all, but in the end that same algorithm is the fastest! (Guava uses the partial quick-sort to find the k smallest/biggest items.)
In itself quicksort is a quite simple algorithm with a tight loop, it would be interesting if you’d do a follow-up post explaining what “mistake” made your implementation so slow! (Maybe you didn’t use the in-place variation?)
Also, did you do several runs or just one? Some of those algorithms may have very different best-, worst- and average performance so several runs nay be recommended. If your quick-sort is based upon a random pivot (not necessarily but it might) then the performance may even change between two runs with the same input data!
Thanks for pointing out that Guava uses quick-sort aswell – I thought it did but wasn’t sure. I realised that it’s not fair to compare a demo implementation with a fully optimised version in Guava – so I pointed that out in the article (hopefully explicit enough).
The implementation I tested isn’t mine – I found it on the site linked in the article. Would be interesting to see how one could optimise it, indeed!
The execution times mentioned in the article are rounded averages – I’ve run the script 50 times. That’s actually an important note because the algorithms have indeed quite different execution times based on the distribution of the random numbers.
Thanks for your comment!
Selecting top elements efficiently without sorting first. Sounds promising, I have to dig into this.
By the way…
The PriorityQueue example is flawed. It doesn’t select the k top elements. It selects k times the top element … and the Guava example (which is even faster) isn’t fully visible.
Thanks shogg, I’ve put a line break into the Guava example to make it fully visible.
I don’t see why PriorityQueue example is flawed – selecting k times the top element does the job and I don’t see how you can use the PriorityQueue differently to perform the job more efficiently – as in: return the top k in one shot (which would obviously be more efficient).
What shogg means is the code should read:
(1..k).collect{heap.poll()}PriorityQueue#peek returns the head element without removing it
Oh, very true. Thanks for that shogg and thursten, will correct that now.
package ro.octavian;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
public class FastSelect {
private static final List inputList = new ArrayList();
static {
Random gen = new Random();
for (int i = 0; i < 10000000; i++) {
inputList.add(gen.nextInt());
}
}
public static void main(String[] args) {
Integer[] inputArray = inputList.toArray(new Integer[inputList.size()]);
FastSelect fastSelect = new FastSelect();
long before = System.currentTimeMillis();
Integer[] outputArray = fastSelect.getTopElements(inputArray, 5);
long after = System.currentTimeMillis();
for (Integer outputElement : outputArray) {
System.out.println(outputElement);
}
System.out.println((after – before) + " ms");
}
public Integer[] getTopElements(Integer[] inputArray, int outputSize) {
Integer[] outputArray = new Integer[outputSize];
final int inputSize = inputArray.length;
for (int ii = 0; ii < inputSize; ii++) {
for (int oi = 0; oi < outputSize; oi++) {
if (outputArray[oi] == null) {
outputArray[oi] = inputArray[ii];
break;
} else if (outputArray[oi].compareTo(inputArray[ii]) oi; boi–) {
outputArray[boi] = outputArray[boi - 1];
}
outputArray[oi] = inputArray[ii];
break;
}
}
}
return outputArray;
}
}
Sorry, it cuts the code between lt gt, trying again:
public Integer[] getTopElements(Integer[] inputArray, int outputSize) {
Integer[] outputArray = new Integer[outputSize];
final int inputSize = inputArray.length;
for (int ii = 0; ii < inputSize; ii++) {
for (int oi = 0; oi < outputSize; oi++) {
if (outputArray[oi] == null) {
outputArray[oi] = inputArray[ii];
break;
} else if (outputArray[oi].compareTo(inputArray[ii]) oi; boi--) {
outputArray[boi] = outputArray[boi - 1];
}
outputArray[oi] = inputArray[ii];
break;
}
}
}
return outputArray;
}
Sorry mate but this doesn’t even compile. The parenthesis count isn’t right (too many closing parenthesis), you use variables that aren’t declared (‘boi’ ?), one if statement is just weird:
‘if (outputArray[oi].compareTo(inputArray[ii]) oi; boi–)’.