0

What is the best way to sort a dictionary with 1Gbyte size(255 char for each word) with 2G of RAM? I have already tried quicksort and didn't get the acceptable result. This the quicksort code:

#include <iostream>
#include <fstream>
#include <cstring>

#define MAXL 4000000
using namespace std;
void swap(char *&ch1,char *&ch2)
{
    char *temp = ch1;
    ch1 = ch2;
    ch2 = temp;
}

int partition (char **arr, int low, int high)
{
    string pivot = arr[high];    // pivot
    int i = (low - 1);  // Index of smaller element

    for (int j = low; j <= high- 1; j++)
    {
        // If current element is smaller than or
        // equal to pivot
        if (arr[j] <= pivot)
        {
            i++;    // increment index of smaller element
            swap(arr[i], arr[j]);
        }
    }
    swap(arr[i + 1], arr[high]);
    return (i + 1);
}

void quickSort(char **arr, int low, int high)
{
    if (low < high)
    {
        int pi = partition(arr, low, high);

        // Separately sort elements before
        // partition and after partition
        quickSort(arr, low, pi - 1);
        quickSort(arr, pi + 1, high);
    }
}


int main()
{
    fstream file("input.txt",ios::in|ios::out|ios::app);
    fstream o("output.txt",ios::out);

    char **arr = new char*[MAXL];
    for(int i=0;i<MAXL;i++)
        arr[i] = new char[255];

    long long i=0;
    while(file)
    {
//words are sepearated by spcae
        file.getline(arr[i],256,' ');
        i++;
    }

    file.close();
    quickSort(arr, 0, i-2);

    for(long long j=0;j<i-1;j++)
    {
        o << arr[j] << "\n";
    }
}

It takes more than 10 minutes to sort the mentioned list but it shouldn't take more than 20 seconds. (MAXL is the number of words in the 1G file and input words are stored in a text file)

15
  • 6
    What was wrong with quicksort ?
    – MBo
    Commented Nov 21, 2018 at 18:59
  • Any decent sorting algorithm is in-place, which means that 1GB is enough. Overhead will be insignificant.
    – user1196549
    Commented Nov 21, 2018 at 19:12
  • @YvesDaoust ANY decent sorting algorithm? The best ones for large datasets tend to be based on mergesort, which is not in-place.
    – btilly
    Commented Nov 21, 2018 at 19:21
  • @YvesDaoust The point that you're ignoring is that, while inappropriate for some purposes, it is a decent sorting algorithm that is not in place. As another example look at Timsort, which is not only decent, but so good on real-world data that it has beaten everything else to become the default in Python, Java, and many other languages. Again, not an in place algorithm. (In part because it can resort to mergesort as a fallback.)
    – btilly
    Commented Nov 21, 2018 at 20:04
  • The point being that you can't just go to a standard library, take the recommended algorithm, and assume that it is in place. Because there is a very good chance that it isn't.
    – btilly
    Commented Nov 21, 2018 at 20:06

2 Answers 2

2

If you can't fit it all in memory, a file-based merge sort will work well.

4
  • How can 1GB data fail to fit in 2GB memory ?
    – user1196549
    Commented Nov 21, 2018 at 19:29
  • 1
    @YvesDaoust Because there is overhead in the in memory representation of raw data. For example if you're using Python, per stackoverflow.com/questions/18596342/… every string has 37 bytes of overhead. So a 1 GB file of words will actually take several GB to load into memory.
    – btilly
    Commented Nov 21, 2018 at 20:15
  • @btilly: the question says 1 GB size. I take it literally.
    – user1196549
    Commented Nov 21, 2018 at 20:18
  • it fits in the memory
    – user10080061
    Commented Nov 21, 2018 at 20:18
0

In-place algorithms are your solution. Find more here:

As another example, many sorting algorithms rearrange arrays into sorted order in-place, including bubble sort, comb sort, selection sort, insertion sort, heapsort, and Shell sort. These algorithms require only a few pointers, so their space complexity is O(log n).

5
  • QuickSort also requires O(log n) when it is properly implemented
    – MBo
    Commented Nov 21, 2018 at 18:58
  • @MBo Curious about where the log n here comes from. Is it the description of the size of the input?
    – mrmcgreg
    Commented Nov 21, 2018 at 19:03
  • 1
    @mrmcgreg Yes. log(n) is size of stack for quicksort. But log(n) space for mentioned algorithms is (too) formal definition based on bit size of variables needed for n range.
    – MBo
    Commented Nov 21, 2018 at 19:11
  • Several of the cited algorithms require O(1) space. I can't adhere to the last sentence.
    – user1196549
    Commented Nov 21, 2018 at 19:31
  • time matters in this sort
    – user10080061
    Commented Nov 22, 2018 at 10:02