Science Fair Project Encyclopedia
In computer science, a selection algorithm is an algorithm for finding the kth smallest (or kth largest) number in a list. This includes the simple common cases of finding the minimum, maximum, and median elements.
One simple and widely used selection algorithm is to use a sort algorithm on the list, and then extract the kth element. This is particularly useful when we wish to make many different selections from a single list, in which case only one initial, expensive sort is needed, followed by many cheap extraction operations. When we need only to perform one selection, however, or when we need to strongly mutate the list between selections, this method can be costly, typically requiring at least O(n log n) time, where n is the length of the list.
Linear minimum/maximum algorithms
Worst-case linear algorithms to find minimums or maximums are obvious; we keep two variables, one referring to the index of the minimum/maximum element seen so far, and one holding its value. As we scan through the list, we update these whenever we encounter a more extreme element:
function minimum(a[1..n]) minIndex := 1 minValue := a for i from 2 to n if a[i] < minValue minIndex := i minValue := a[i] return minValue
function maximum(a[1..n]) maxIndex := 1 maxValue := a for i from 2 to n if a[i] > maxValue maxIndex := i maxValue := a[i] return maxValue
Note that there may be multiple minimum or maximum elements. Because the comparisons above are strict, these algorithms finds the minimum element with minimum index. By using non-strict comparisons (<= and >=) instead, we would find the minimum element with maximum index; this approach also has some small performance benefits.
Using the same idea, we can construct a simple, but inefficient general algorithm for finding the kth smallest or kth largest item in a list, requiring O(kn) time, which is effective when k is small. To accomplish this, we simply find the most extreme value and move it to the beginning until we reach our desired index. This can be seen as an incomplete selection sort. Here is the minimum-based algorithm:
function select(a[1..n], k) for i from 1 to k minIndex = i minValue = a[i] for j = i+1 to n if a[j] < minValue minIndex = j minValue = a[j] swap a[i] and a[minIndex] return a[k]
Other advantages of this method are:
- After locating the jth smallest element, it requires only O(j + (k-j)2) time to find the kth smallest element, or only O(k) for k ≤ j.
- It can be done with linked list data structures, whereas the one based on partition requires random access.
Linear general selection algorithms
Finding a worst-case linear algorithm for the general case of selecting the kth largest element is a much more difficult problem, but one does exist, and was published by Blum, Floyd, Pratt, Rivest, and Tarjan in their 1973 paper Time bounds for selection. It uses concepts based on those used in the quicksort sort algorithm, along with an innovation of its own.
In quicksort, there is a subprocedure called partition which can, in linear time, group the list into two parts, those less than a certain value, and those greater than or equal to a certain value. Here is pseudocode which performs a partition about the value
function partition(a, left, right, pivotIndex) pivotValue := a[pivotIndex] swap a[pivotIndex] and a[right] // Move pivot to end storeIndex := left for i from left to right-1 if a[i] ≤ pivotValue swap a[storeIndex] to a[i] storeIndex := storeIndex + 1 swap a[right] and a[storeIndex] // Move pivot to its final place return storeIndex
In quicksort, we recursively sort both branches, leading to best-case Ω(n log n) time. However, when doing selection, we already know which partition our desired element lies in, since the pivot is in its final sorted position, with all those preceding it in sorted order preceding it and all those following it in sorted order following it. Thus a single recursive call locates the desired element in the correct partition:
function select(a, k, left, right) select a pivot value a[pivotIndex] pivotNewIndex := partition(a, left, right, pivotIndex) if k = pivotNewIndex return a[k] else if k < pivotNewIndex return select(a, k, left, pivotNewIndex-1) else return select(a, k, pivotNewIndex+1, right)
Note the resemblance to quicksort; indeed, just as the minimum-based selection algorithm is a partial selection sort, this is a partial quicksort, generating and partitioning only O(log n) of its O(n) partitions. This simple procedure has expected linear performance, and, like quicksort, has quite good performance in practice. It is also an in-place algorithm, requiring only constant memory overhead, since the tail recursion can be eliminated with a loop like this:
function select(a, k, left, right) loop select a pivot value a[pivotIndex] pivotNewIndex := partition(a, left, right, pivotIndex) if k = pivotNewIndex return a[k] else if k < pivotNewIndex right := pivotNewIndex-1 else left := pivotNewIndex+1
Like quicksort, the performance of the algorithm is sensitive to the pivot that is chosen. If bad pivots are consistently chosen, this degrades to the minimum-based selection described previously. However, there is a way to consistently find very good pivots; this is the key to making the algorithm worst-case linear.
To accomplish this, we begin by dividing the list into groups of five elements. Any left over are ignored for now. Then, for each of these, we find the median of the group of five, an operation that can be made very fast by loading all five values into the register set and comparing them. We move all these medians into one contiguous block in the list, and proceed to invoke select recursively on this sublist of n/5 elements to find its median value. Then, we use this "median of medians" for our pivot.
What makes this a good pivot? Note that it is both less and greater than half the elements in our list of medians, or n/10 elements. Moreover, each of these elements is a median of 5, and so is both less and greater than 2 others outside the block, or 2(n/10) elements. Thus the median we chose splits the elements somewhere between 30%/70% and 70%/30%, which is more than good enough to assure worst-case linear behavior of the algorithm.
The trick, however, is ensuring that the added recursive call does not destroy this worst-case linear behavior. This is because the list of medians is 20% of the size of the list, and the other recursive call recurses on at most 70% of the list, so we have this recurrence relation for the running time:
T(n) ≤ T(n/5) + T(7n/10) + O(n)
The O(n) is for the partitioning work. But, because n/5 + 7n/10 is 9n/10 < n, it's not hard to find a linear formula for running time that can be plugged into this formula to make it true.
In practice, although this approach optimizes quite well, it is still typically outperformed by a considerable margin by the expected linear algorithm with fairly naive pivot choices, such as choosing randomly. The worst-case algorithm still has importance, however; it can be used, for example, to construct a worst-case O(n log n) quicksort algorithm, by using it to find the median at every step.
Selection as incremental sorting
One of the advantages of the sort-and-index approach, as mentioned, is its ability to amortize the sorting cost over many subsequent selections. However, sometimes the number of selections that will be done is not known in advance, and may be either small or large. In these cases, we can adapt the algorithms given above to simultaneously select an element while partially sorting the list, thus accelerating future selections.
Both the selection procedure based on minimum-finding and the one based on partitioning can be seen as a form of partial sort. The minimum-based algorithm sorts the list up to the given index, and so clearly speeds up future selections, especially of smaller indexes. The partition-based algorithm does not achieve the same behaviour automatically, but can be adapted to remember its previous pivot choices and reuse them wherever possible, avoiding costly partition operations, particularly the top-level one. The list becomes gradually more sorted as more partition operations are done incrementally; no pivots are ever "lost." If desired, this same pivot list could be passed on to quicksort to reuse, again avoiding many costly partition operations.
Using data structures to select in sublinear time
Given an unorganized list of data, linear time (Ω(n)) is required to find the minimum element, because we have to examine every element (otherwise, we might miss it.) If we organize the list, for example by keeping it sorted at all times, then selecting the kth largest element is trivial, but then insertion requires linear time, as do other operations such as combining two lists.
When only the minimum (or maximum) is needed, a better approach is to use a priority queue, which is typically able to find the minimum element in constant time, while all other operations, including insertion, are O(log n). More generally, a self-balancing binary search tree can easily be augmented to make it possible to both insert an element and find the kth largest element in O(log n) time.
- M. Blum, R.W. Floyd, V. Pratt, R. Rivest and R. Tarjan, "Time bounds for selection," J. Cornput. System Sci. 7 (1973) 448-461.
The contents of this article is licensed from www.wikipedia.org under the GNU Free Documentation License. Click here to see the transparent copy and copyright details