Parallel Quicksort

Question

As my parallel mergesort implementation was very memory dependent, I wanted to write a parallel quicksort one. Here it comes:

template<class In>
In partition(In b, In e) {
    // create uniform distribuiton for random engine
    std::uniform_int_distribution<int> rand_distribution(0, std::distance(b, e - 1));

    // call static random engine to get index of random pivot
    tbb::spin_mutex::scoped_lock lock;
    lock.acquire(mutex);
    auto rand_pivot_index = rand_distribution(rand_engine);
    lock.release();

    // put pivot to last place in range
    std::swap(*(b + rand_pivot_index), *(e - 1));

    // save pivot value
    auto pivot = *(e - 1);
    auto pivotiterator = b;

    // sort the range
    for(auto it = b; it != e - 1; ++it) {
        if(*it < pivot) {
            std::swap(*it, *pivotiterator);
            ++pivotiterator;
        }
    }

    // put pivot to its according position and return position
    std::swap(*(pivotiterator), *(e - 1));
    return pivotiterator;
}


template<class In>
void quick_sort_parallel(In b, In e) {
    if (b != e) {
        In m = partition(b, e);

        // switch to different sort on worst case or smaller ranges
        if(std::distance(b, m) > 500) {
            tbb::parallel_invoke([&] { quick_sort_parallel(b, m); }, 
                                 [&] { quick_sort_parallel(m + 1, e); });
        }
        else 
            merge_sort_parallel(b, e); //defined somewhere else, pretty standard
    }
}

cmh · Accepted Answer · 2012-04-13 16:45:12Z

Code

I would suggest using RAII for your lock.

Instead of:

std::uniform_int_distribution<int> rand_distribution(0, std::distance(b, e - 1));

// call static random engine to get index of random pivot
tbb::spin_mutex::scoped_lock lock;
lock.acquire(mutex);
auto rand_pivot_index = rand_distribution(rand_engine);
lock.release();

// put pivot to last place in range
std::swap(*(b + rand_pivot_index), *(e - 1));

Use:

std::uniform_int_distribution<int> rand_distribution(0, std::distance(b, e - 1));

// call static random engine to get index of random pivot
int rand_pivot_index;
{
    tbb::spin_mutex::scoped_lock lock(mutex);
    rand_pivot_index = rand_distribution(rand_engine);
}

// put pivot to last place in range
std::swap(*(b + rand_pivot_index), *(e - 1));

This ensures that if rand_distribution throws, the lock is released correctly.

Algorithm

Choice of pivot is important in any quick-sort algorithm, and even more so in the parallel case, when you want to garuntee a good distribution of load over cores by avoiding unbalanced recursion.

A good basic strategy is to choose the median of k random elements as your pivot.

Taking this further, using a median of medians algorithm can guarantee your quicksort a O(n log n) running time.

The correct approach and value of k will largely depend on the inputs your function will expect to take (e.g. adapt k based on size on input).

Of course RAII is the way to manage locks. Today I would do it all a little bit different anyway. Concerning the median of median thing, I don't understand how you can guarantee O(n log n) on an array of only dupes. — inf, Commented Apr 14, 2012 at 7:54
If your input can have non distinct elements then you also need your partition step to split into three. 1: (< pivot), 2: (= pivot), 3: (> pivot). Then recurse on 1 and 3. This will get you to O(n log n). — cmh, Commented Apr 14, 2012 at 9:22
@MORTAL from the original question - it's an unimportant implementation detail. — cmh, Commented Dec 18, 2014 at 17:27

Morwenn · Accepted Answer · 2016-04-06 13:56:28Z

Actually, how your pseudo-random number generator is declared does matter, for depending on how it is declared, you may avoid using a lock to generate random numbers. Generating a random number is alreadu an expensive operation, so locking them makes it really expensive.

If you make your pseudo-random number generator thread_local, you don't need to lock anything anymore before generating a random number, it will be inherently thread-safe:

thread_local std::random_device rd;
thread_local std::mt19937 gen(rd());

Moreover, if you also make thread_local your std::uniform_int_distribution, you won't have to update it every time you call partition, but only once in each thread. That said, you can't initialize it with its "range" anymore, you have to pass it when you call it:

// Initialize it once
thread_local std::uniform_int_distribution<int> rand_distribution{};

// Call it
using param_type = std::uniform_int_distribution<int>::param_type;
auto rand_pivot_index = rand_distribution(gen, {0, std::distance(b, e - 1)});

If I didn't make any error while writing the code (untested), this should be thread-safe, lock-free and faster than your original solution :)

That's a good idea these days. Guess thread_local wasn't even supported back in 2012. — inf, Commented Apr 6, 2016 at 14:00
@inf Oh yeah, the question is becoming a bit old. I know that GCC and CLang have supported an almost equivalent __thread for a while, but I guess it took a bit longer to MSVC :) — Morwenn, Commented Apr 6, 2016 at 14:01

Stack Exchange Network

Parallel Quicksort

2 Answers 2

Code

Algorithm

Not the answer you're looking for? Browse other questions tagged
c++
multithreading
c++11
quick-sort
sorting
or ask your own question.

Linked

Hot Network Questions

Parallel Quicksort

2 Answers 2

Code

Algorithm

Not the answer you're looking for? Browse other questions tagged c++multithreadingc++11quick-sortsorting or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
c++
multithreading
c++11
quick-sort
sorting
or ask your own question.