15
\$\begingroup\$

Pathological Sorting

Your boss has demanded that you develop a sorting algorithm to improve the performance of your company's application. However, having written the application, you know that you're unlikely to be able to make it significantly faster. Not wanting to disappoint your boss, you've decided to develop a new algorithm that works even better than *sort on certain sets of data. Of course, you can't make it obvious that the algorithm only works on some cases, so you want to make it obscure as possible.

The objective of this contest is to write a sorting routine in the language of your choice that performs better on certain sets of data than others, with repeatable results. The more specific the classification that determines speed, the better. The algorithm must do sorting of some kind, so an algorithm that depends on the data already being completely sorted (as in, an algorithm that does nothing), or an algorithm that depends on the data being completely sorted in reverse, are both invalid. The sorting algorithm must correctly sort any set of data.

After presenting your routine, please include an explanation of why it only works on certain sets of data, and include test runs on at least one set of good (fast) data and one set of bad (slow) data. The point here is to be able to prove to your boss that you've stumbled upon a better way to sort, so more test data is better. Of course, you're only going to show your boss the test results from the good data, so the flaw in the required testing data can't be too obvious. If applicable to your language, please show that your algorithm is faster than your language's built-in sorting algorithm.

For example, one might submit an insertion sort algorithm, with the good data being data that is already nearly sorted, and bad data being completely random data, since insertion sort approaches O(n) on nearly-sorted data. However, this isn't very good, since my boss would probably notice that all of the testing data is nearly sorted to begin with.

This is a , so the answer with the most votes after 7 days (May 21) wins.

If no one beats me to it, I'd like to submit a community wiki answer that takes advantage of uniformly distributed data sets.

\$\endgroup\$
1

2 Answers 2

9
\$\begingroup\$

It's been a pretty long time, but I remember back in Algorithms 101 we were taught some sorting algorithm that used randomization. I wasn't a very good student so I don't really remember how it went or why it worked quickly on average.

Nevertheless, I've decided that this problem calls for a solution that uses randomization, which will hopefully work in my favor on average.

import random

def arrayIsSorted (arr) :
    for i in range(len(arr)-1) :
        if arr[i]>arr[i+1] : return False
    return True

def rSort (arr) :
    random.seed (42)
    counter = 0
    while not arrayIsSorted(arr) :
        random.shuffle (arr)
        counter+=1
    print ("Sorted in %d iterations." % counter)
    return arr

Since true randomization is important, I make sure to seed the RNG with the answer to Life, the Universe and Everything. After a bit of testing it turns out that that was a smart move! Check out how fast these 2 completely arbitrary lists get sorted:

rSort ([6,1,4,2,3,7,5])
rSort ([8,9,6,1,4,7,2,3,5])

Both of these get sorted in only 1 iteration - you couldn't possibly ask for a faster function than that!

Now, admittedly, some other lists produce slightly worse results...

rSort ([5,1,4,2,3,7,6])
rSort ([8,9,6,1,4,7,2,5,3])

These get sorted in 4,176 and 94,523 iterations respectively, which actually takes more than a second... but let's just keep that fact to ourselves so as not to distract anyone from how amazing this algorithm is!

Edit:

I've been asked to prove my algorithm's efficiency on a list of 100 items, so here you go:

rSort ([70, 6, 52, 97, 85, 61, 62, 48, 30, 3, 11, 88, 39, 91, 98, 8, 54, 92, 44, 65, 69, 21, 58, 41, 60, 76, 27, 82, 93, 81, 20, 94, 22, 29, 49, 95, 40, 19, 55, 42, 43, 1, 0, 67, 35, 15, 51, 31, 16, 25, 5, 53, 37, 74, 86, 12, 13, 72, 56, 32, 47, 46, 59, 33, 80, 4, 45, 63, 57, 89, 7, 77, 14, 10, 34, 87, 18, 79, 9, 66, 24, 99, 64, 26, 78, 38, 90, 28, 83, 75, 68, 2, 17, 73, 96, 71, 23, 84, 36, 50])

Even this long and completely arbitrary list gets sorted instantly! Truly I must have stumbled upon the best sorting algorithm in the world!

\$\endgroup\$
11
  • 3
    \$\begingroup\$ Can we get some test results on slightly larger datasets? Maybe one with 100 elements? ;) \$\endgroup\$
    – Geobits
    Commented May 14, 2014 at 19:52
  • \$\begingroup\$ @Geobits No problem, here it is :) \$\endgroup\$
    – Tal
    Commented May 14, 2014 at 20:11
  • 1
    \$\begingroup\$ @Geobits Yes it does. Eventually. \$\endgroup\$
    – Tal
    Commented May 14, 2014 at 20:17
  • 3
    \$\begingroup\$ It's a stretch, but it could be argued that it uses bogosort, which will eventually sort the array, given enough time. I'm willing to bet that 'shuffle and repeat' qualifies as asorting, albeit not good sorting. \$\endgroup\$
    – millinon
    Commented May 14, 2014 at 20:23
  • 1
    \$\begingroup\$ If it was true random shuffles, maybe. PRNGs have a cycle, so I can't see how you could guarantee all permutations are tried. \$\endgroup\$
    – Geobits
    Commented May 14, 2014 at 20:26
2
\$\begingroup\$

If you can create your own data, then it's pretty straightforward - get data that looks random, but includes a key for faster sorting. All other data uses the original sorting method, so the average times are better.

One easy way is to make sure each data item has a unique key, and then just hash the keys. Take for example a list with the numbers from 1-10,000, all multiplied by 16, and with a random number from 0-15 added to it (see fillArray() below). They will look random, but each one has a unique sequential key. For sorting, divide by 16 (in C the >>4 is very fast) and then just place the number into an array using the resulting key as the index. One pass and you're done. In testing, I found quicksort was 30 times slower on ten million numbers.

void fillArray(int *a,int len)
{
  for (int i=0;i<len;++i)
    a[i]=(i<<4)|(rand()&0xF);
  // shuffle later
}
void sortArray(int *a,int len)
{
  int key=0;
  int *r=new int[len];
  for (int i=0;i<len;++i)
  {
    key=a[i]>>4;
    r[key]=a[i];
  }
  memcpy(a,r,len*sizeof(int));
  delete[] r;
}
void shuffleArray(int *a,int len)
{
  int swap=0, k=0;
  for (int i=0;i<len;++i)
  {
    k=rand()%len;
    swap=a[k];
    a[k]=a[i];
    a[i]=swap;
  }
}
int qCompare(const void*a,const void*b)
{
  int result=*((int*)a)-*((int*)b);
  return result;
}
void main()
{
  int aLen=10000;
  int *a=new int[aLen];
  srand (time(NULL));
  fillArray(a,aLen);
  // time them
  long t0=0, d0=0, d1=0;
  // qsort
  shuffleArray(a,aLen);
  t0=::GetTickCount();
  qsort(a,aLen,sizeof(int),&qCompare);
  d0=::GetTickCount()-t0;
  // oursort
  shuffleArray(a,aLen);
  t0=::GetTickCount();
  sortArray(a,aLen);
  d1=::GetTickCount()-t0;
  delete[] a;
}

Anything that has a unique key can be sorted this way - if you have the memory to store it, of course. For example, many databases use a unique numeric customer id - if the list is small/sequential enough this could be held in memory. Or some other way to translate a record into a unique number. For more info, research Hash Sorts, since that's what this is...

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.