Showing posts with label sort. Show all posts
Showing posts with label sort. Show all posts

Saturday, November 20, 2010

Merge Sort Improvement?



C++ STL uses a trick to improve the performance of merge sort. Which is, when the number of elements in a call goes down to a few 50 or something near, it uses insertion sort to sort those elements instead of recurring farther. However, this doesn't seem to be much helpful when we write the merge sort procedure manually.

Probably this happens to STL because of, being OOP nature, STL methods have huge overhead on mostly each of them, so allocating more recursive calls might turn out to be more costly than a small O(k^2) sub-routine. As we have seen on the previous posts, recursive calls of merge sort procedure creates a binary tree like structure and the deeper it goes, the number of nodes increases dramatically. Moreover, allocating new functions on call stack always has a overhead for computer systems. So, the just change the algorithm a bit so that the bottom few depths are cut from the tree reducing huge number of nodes from it, i.e. reducing huge overhead which in result, improves performance.


So they changes the algorithm of MERGE-SORT() a bit:

MERGE-SORT(A, p, r):
if r - p < k ;k is the tolerance limit
INSERTION-SORT(A, p, r)
return
if p < r
then q := (r + p) / 2
MERGE-SORT(A, p, q)
MERGE-SORT(A, q+1, r)
MERGE(A, p, q, r)

Now, what happens when we use it manually? I tried it using a huge range and used k = 15, which means, when the number of elements are below 15, it will use insertion sort. I found interesting result in this experiment. Insertion sort seems better under certain range, but after that, it keeps getting slower. So, the value k is a tread off between the overhead of recursive calls and running time. This graph shows the result of my experiment, certainly it will vary if you try to do the same.

[click on the image to get a clear view]

Here is a simple java tester I used for this comparison. Have a look:

import java.util.*;
import java.io.*;

public class Main {

static final boolean useInsertion = false;
static final int limit = 15;

public static void main(String[] args) throws IOException {
Scanner stdin = new Scanner(new FileReader("in.txt"));
PrintWriter out = new PrintWriter(new FileWriter("out.txt"));
int n = stdin.nextInt();
int[] a = new int[n];
for(int i = 0; i < n; i++) a[i] = stdin.nextInt();
out.println("Elements: " + n);
long start = System.nanoTime();
mergeSort(a, 0, n-1);
long end = System.nanoTime();
for(int i = 0; i < n; i++) out.println(a[i] + " ");
out.println("Elapsed Time: " +(double)(end - start)/1000000.0 + "ms.");
out.flush();
stdin.close();
out.close();
}

static void mergeSort(int[] a, int p, int r) {
if(useInsertion && r-p < limit) {
insertionSort(a, p, r);
return;
}
if(p < r) {
int q = (p + r) / 2;
mergeSort(a, p, q);
mergeSort(a, q+1, r);
merge(a, p, q, r);
}
}

static void merge(int[] a, int p, int q, int r) {
int n1 = q-p+1, n2 = r-q;
int[] L = new int[n1];
int[] R = new int[n2];
for(int i = 0; i < n1; i++) L[i] = a[p+i];
for(int j = 0; j < n2; j++) R[j] = a[q+j+1];
for(int k = p, i = 0, j = 0; k <= r; k++) {
if(j >= n2 || (i < n1 && L[i] <= R[j])) a[k] = L[i++];
else a[k] = R[j++];
}
}

static void insertionSort(int[] a, int p, int r) {
for(int i = p+1; i <= r; i++) {
int t = a[i];
for(int j = i - 1; j >= p; j--) {
if(t > a[j]) break;
a[j+1] = a[j];
}
a[j+1] = t;
}
}
}

Change the value of static final boolean useInsertion = false; to 'true' to enable using insertion sort and change the value of static final int limit = 15; to suitable limit, this is the number of elements when to apply insertion sort.

Keep digging!

Wednesday, September 29, 2010

Threaded Merge Sort



What?


In my previous post about Merge Sort, the algorithm for merge sort was discussed. Although the pictures shows something which appears to be apparently parallel, but actually they are not. In fact, as the procedure introduced there was recursive, it works in a post-order fashion, meaning, first it solves the left half, then the right half, and then it works with the current range. The following picture shows the actual work-flow of the standard recursive merge sort algorithm:


The red lines and numbers shows the actual call sequence of the algorithm merge sort, for reference, the basic algorithm, is shown here again:

MERGE-SORT(A, p, r):
if p < r
then q := (r + p) / 2
MERGE-SORT(A, p, q)
MERGE-SORT(A, q+1, r)
MERGE(A, p, q, r)

So, from the above example, we can see that no two calls are concurrent, i.e. parallel. Each call is one of the three calls made in MERGE-SORT(A, p, r) procedure stated above, also shown below (according to the picture, p=0, r=6):

Step-00: MERGE-SORT(A, 0, 6)
Step-01: MERGE-SORT(A, 0, 3)
Step-02: MERGE-SORT(A, 0, 1)
Step-03: MERGE-SORT(A, 0, 0)
Step-04: MERGE-SORT(A, 1, 1)
Step-05: MERGE(A, 0, 0, 1)
Step-06: MERGE-SORT(A, 2, 3)
Step-07: MERGE-SORT(A, 2, 2)
Step-08: MERGE-SORT(A, 3, 3)
Step-09: MERGE(A, 2, 2, 3)
Step-10: MERGE(A, 0, 1, 3)
Step-11: MERGE-SORT(A, 4, 6)
Step-12: MERGE-SORT(A, 4, 5)
Step-13: MERGE-SORT(A, 4, 4)
Step-14: MERGE-SORT(A, 5, 5)
Step-15: MERGE(A, 4, 4, 5)
Step-16: MERGE-SORT(A, 6, 6)
Step-17: MERGE(A, 4, 5, 6)
Step-18: MERGE(A, 0, 3, 6)

Also note that, there are no overlapping calls at a same level of the tree, which is going to be a key factor.

How threading helps:


After a few observations, we can easily find out some interesting properties about merge sort algorithm. If we closely look at the merge sort tree structure, it becomes very clear that, int the recursive call tree, parent nodes depend on children, but siblings are independent. This property helps us to deduce a threaded version of merge sort algorithm where, we can really solve the left and right sub-portion of a range in a parallel fashion by making the calls threaded.

As the algorithm here is recursive, it might look confusing at the first glance, but it is not. Because, nodes share data only with their ancestors and in a call, we can wait until the threads finish their works. So, no concurrency problem appears here. Also, we don't need to worry about mutual exclusion of the shared data, because, there is no such situation in this algorithm, already stated above.

So, the threaded version of the algorithm will look like this:

THREADED-MERGE-SORT(A, p, r):
if p < r
then q := (r + p) / 2
start_thread(left_thread, THREADED-MERGE-SORT(A, p, q))
start_thread(right_thread, THREADED-MERGE-SORT(A, q+1, r))
wait_for(left_thread)
wait_for(right_thread)
MERGE(A, p, q, r)

According to the above algorithm, the flow of program changes which is shown in the picture below:


Now because of being able to run multiple THREADED-MERGE-SORT() at a time, and the job scheduling features of modern computers, the effective number of iteration is dramatically reduced, resulting a much much elegant solution for the heavy duty areas like manipulating large data on drives.

A simple implementation:


Here a simple implementation is presented using pthread library in C++:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

typedef struct { int i, j; } pair;
const int MAX = 1024;
int a[MAX];
pthread_attr_t attr;

void *threaded_merge_sort(void *param) {
pair range = *(pair *)param, lrange, rrange;
pthread_t lside, rside;
int p = range.i, q, r = range.j, n1, n2, n = r-p+1, i, j, k;
int *aleft, *aright;
if(p < r) {
q = (p + r) >> 1;
lrange.i = p, lrange.j = q, rrange.i = q + 1, rrange.j = r;
pthread_create(&lside, &attr, threaded_merge_sort, (void *)&lrange);
pthread_create(&rside, &attr, threaded_merge_sort, (void *)&rrange);
pthread_join(lside, NULL);
pthread_join(rside, NULL);
n1 = q - p + 1, n2 = r - q;
aleft = (int *)malloc(sizeof(int) * n1);
aright = (int *)malloc(sizeof(int) * n2);
for(i = 0; i < n1; i++) aleft[i] = a[p+i];
for(i = 0; i < n2; i++) aright[i] = a[q+1+i];
for(k = i = j = 0; k < n; k++) {
if(i >= n1 || (j < n2 && aleft[i] > aright[j])) a[k+p] = aright[j++];
else a[k+p] = aleft[i++];
}
free(aleft);
free(aright);
}
return NULL;
}

int main() {
int n, i;
pthread_t sorter;
pair range;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
while(scanf("%d", &n)==1 && n) {
for(i = 0; i < n; i++) scanf("%d", &a[i]);
range.i = 0, range.j = n-1;
pthread_create(&sorter, &attr, threaded_merge_sort, (void *)&range);
pthread_join(sorter, NULL);
for(i = 0; i < n; i++) printf("%d%c", a[i], (i==n-1? '\n' : ' '));
}
pthread_attr_destroy(&attr);
return 0;
}

This may look a bit messy, but, actually, you may find a little difference with the original one, we just created threads instead of plain recursive calls, that's it.
To learn more about pthread, check these out:

Give it a try! :)

Thursday, August 19, 2010

Merge Sort



Introduction:


Merge-sort is based on the divide-and-conquer paradigm. The Merge-sort algorithm can be described in general terms as consisting of the following three steps:

  1. Divide Step: If given array A has zero or one element, return S; it is already sorted. Otherwise, divide A into two arrays, A1 and A2, each containing about half of the elements of A.

  2. Recursive Step: Recursively sort array A1 and A2.

  3. Conquer Step: Combine the elements back in A by merging the sorted arrays A1 and A2 into a sorted sequence.


This diagram below shows the divide and conquer (or merging) steps stated above. We can visualize Merge-sort by means of binary tree where each node of the tree represents a recursive call and each external nodes represent individual elements of given array A. Such a tree is called Merge-sort tree. The heart of the Merge-sort algorithm is conquer step, which merge two sorted sequences into a single sorted sequence. This simple but amazing algorithm and a straight forward C++ implemented is presented below, and some cool links are added in the "Reference" section at the end of the post.

Algorithm:


Input: Array A[p...r], indices p, q, r (p ≤ q < r).
Output: Array A[p...r] in ascending order

MERGE-SORT(A, p, r):
if p < r
then q := (r + p) / 2
MERGE-SORT(A, p, q)
MERGE-SORT(A, q+1, r)
MERGE(A, p, q, r)

MERGE(A, p, q, r):
n1 := q - p + 1
n2 := r - q
create arrays L[1...N1+1] and R[1...N2+1]
for i := 1 to N1
do L[i] := A[p + i - 1]
for j := 1 to N2
do R[j] := A[q + j]
L[N1 + 1] := INF
R[N2 + 1] := INF
i := 1
j := 1
for k := p to r
do if L[i] <= R[j]
then A[k] := L[i]
i := i + 1
else
A[k] := R[j]
j := j + 1

Follow this link to see a javascript demonstration that simulates the above algorithm.

Analysis:


In sorting n objects, merge sort has an average and worst-case performance of O(n log n). If the running time of merge sort for a list of length n is T(n), then the recurrence T(n) = 2T(n/2) + n follows from the definition of the algorithm (apply the algorithm to two lists of half the size of the original list, and add the n steps taken to merge the resulting two lists) and the closed form follows from the master theorem.

Simple proof:
The merge sort algorithm created a complete binary tree, which have d depth and at each level, a total of n elements.
So, 2^d ≈ n, which implies d ≈ lg n
Now the total numbers of operation in merge sort algorithm is:
n * 2d ≈ 2n lg n ≈ O(n lg n)

In the worst case, merge sort does an amount of comparisons equal to or slightly smaller than (n ⌈lg n⌉ - 2⌈lg n⌉ + 1), which is between (n lg n - n + 1) and (n lg n + n + O(lg n)).

In the worst case, merge sort does about 39% fewer comparisons than quicksort does in the average case. Merge sort always makes fewer comparisons than quicksort, except in extremely rare cases, when they tie, where merge sort's worst case is found simultaneously with quicksort's best case. In terms of moves, merge sort's worst case complexity is O(n log n) — the same complexity as quicksort's best case, and merge sort's best case takes about half as many iterations as the worst case. Although, depending on the machine's memory architecture, quick sort can sometimes outperform merge sort, which is a very rare case.

Recursive implementations of merge sort make 2n - 1 method calls in the worst case, compared to quicksort's n, thus merge sort has roughly twice as much recursive overhead as quicksort. However, iterative, non-recursive, implementations of merge sort, avoiding method call overhead, are not difficult to code. Merge sort's most common implementation does not sort in place; therefore, the memory size of the input must be allocated for the sorted output to be stored in.

Merge sort as described here also has an often overlooked, but practically important, best-case property. If the input is already sorted, its complexity falls to O(n). Specifically, n-1 comparisons and zero moves are performed, which is the same as for simply running through the input, checking if it is pre-sorted.

Although heap sort has the same time bounds as merge sort, it requires only T(1) auxiliary space instead of merge sort's T(n), and is often faster in practical implementations. In case in-place sorting is necessary (Merge sort can be implemented as in-place algorithm, but the performance gain is not worth the complexity of the program, however, still merge sort runs in O(n lg n) time), heap sort is a better choice.

C++ Implementation:


Call Merge_Sort(A, start, end) to sort the closed range [start, end] of the array A.

void Merge(int A[], int p, int q, int r) {
int i, j, k, n1 = q - p + 1, n2 = r - q;
int L[n1], R[n2];
for(i = 0; i < n1; i++)
L[i] = A[p + i];
for(j = 0; j < n2; j++)
R[j] = A[q + j + 1];
for(k = p, i = j = 0; k <= r; k++) {
if(j >= n2 || (i < n1 && L[i] <= R[j])) A[k] = L[i++];
else A[k] = R[j++];
}
}

void Merge_Sort(int A[], int p, int r) {
if(p < r) {
int q = (p + r) / 2;
Merge_Sort(A, p, q);
Merge_Sort(A, q+1, r);
Merge(A, p, q, r);
}
}

For implementations in other languages, this page contains the implementation of merge sort procedure in almost all the living languages of the world: http://rosettacode.org/wiki/Sorting_algorithms/Merge_sort



Reference:



Thanks for reading.