7

I have a list containing ~50k elements of a custom data type (the latter is probably not important for my question) I'm sorting the list using pythons builtin list.sort() method.

myList: List[Foo] = ...
myList.sort(key=Foo.x)

Since the sorting takes a couple of minutes, I would like to have a progress bar for the sorting process. I haven't found any solutions online.

Is this even possible? I'm aware sorting algorithms may be complex and it might not be possible to measure the sorting progress at all. However, it would be fine for my usecase to have a "rough" measurement, like 25%, 50%, 75%...

10
  • 3
    Depends on the sorting algorithm. However, since this is 50k elements, it should take less than a second. Maybe pre-cache the keys since that's probably the bottleneck. Commented Jul 11, 2023 at 10:33
  • @MateenUlhaq That's where the custom data type comes in. I simplified it to Foo.x, but there is actually function call and a fair amount of processing involved in determining x. Caching the result is a good idea.
    – Zciurus
    Commented Jul 11, 2023 at 12:41
  • @MateenUlhaq How would "pre-cache the keys" help? Commented Jul 29, 2023 at 13:54
  • What are your resulting keys? Are they simple values like numbers that compare quickly (so most of the time is spent computing them before the actual sorting begins)? Or are they complex objects that compare slowly? Commented Jul 29, 2023 at 14:01
  • 1
    @KellyBundy Sorting randomly generated 50k primitive values takes only 0.04s on my PC... so presumably, the bottleneck is during the call of key=Foo.x. Assuming this is pure, if this is called more than once on the same Foo object during the program, the computation time may be reduced via caching or similar methods. Commented Jul 29, 2023 at 23:11

3 Answers 3

14
+100

Given the interface provided by sort, you don't have too many options for hooking into the actual sorting algoithm. However, if 50K keys is slow, it is most likely that invoking the key function is slow, which is computed prior to the actual sorting.

From the docs:

The key corresponding to each item in the list is calculated once and then used for the entire sorting process.

So if you keep count of how many times the key method is invoked, you could gain a rough estimate for the whole sorting process. To do so, you could create a wrapper for the key function to manage this book keeping:

def progress_sort(data, *, key=lambda v: v, on_increment=None):
    total = len(data)

    if on_increment is None:
        start = time.time()

        def on_increment(c):
            print(f"{time.time() - start}: {c/total * 100}%")

    count = 0

    def progress_key(val):
        nonlocal count
        if count % int(total / 10) == 0:
            on_increment(count)
        count += 1
        return key(val)

    data.sort(key=progress_key)
    on_increment(total)

Example with some dummy data and a slow key method

def slow_key(val):
    time.sleep(1.0/500_000)
    return val

data = [random.randint(-50_000, 50_000)/1.0 for i in range(50_000)]
progress_sort(data, key=slow_key)
0.0: 0.0%
0.5136210918426514: 10.0%
1.0435900688171387: 20.0%
1.6074442863464355: 30.0%
2.156496524810791: 40.0%
2.9734878540039062: 50.0%
3.4794368743896484: 60.0%
4.016523599624634: 70.0%
4.558118104934692: 80.0%
5.047779083251953: 90.0%
5.545809030532837: 100.0%

This method could then be combined with whatever type of library you wish to use for updating status. You may wish to further configure the data supplied to the provided hooks, however, the principle remains the same.

Here's an example that uses tqdm:

def slow_key(val):
    time.sleep(1.0/500_000)
    return val


data = [random.randint(-50_000, 50_000)/1.0 for i in range(50_001)]

with tqdm(total=len(data), desc="sorting") as pbar:
    progress_sort(data, key=slow_key, on_increment=lambda c: pbar.update(c - pbar.n))
    pbar.set_description("Finished")
sorting:  80%|███████▉  | 40000/50001 [00:05<00:01, 5802.30it/s]
Finished: 100%|██████████| 50001/50001 [00:07<00:00, 6489.14it/s]
3
  • i assumed the same slowness in the key., but I decided that generating a key list outside of the sort as a preliminary step was a good alternative option worthy of it's own answer.
    – toppk
    Commented Jul 30, 2023 at 7:40
  • 2
    @toppk Note that this is what sort is doing already. Pre-computing the key table is somewhat redundant to what the cpython implementation will already perform for you! :) github.com/python/cpython/blob/…
    – flakes
    Commented Jul 30, 2023 at 7:44
  • By now I think it would be ideal to have two progress bars, one for the compute-keys phase (as you did) and one for the actual sorting phase. Either phase could be fast or slow. OP didn't answer my question but I guess their acceptance of this indicates that you were right about their case, that computing keys is much slower for them. A progress bar for the actual sorting is trickier and more interesting, though, as it's not just linear but somewhere between linear and linearithmic. Commented Aug 5, 2023 at 14:36
2

I also assumed that the key determination was the slow part of the sort, which is expected with such a relatively small list size (50k). This answer's approach takes the solution of crafting an intermediate list of just the keys and the object reference. This can be quantified and progress can be shown after each object has its key determined. For this demonstration this was made slow by using a key routine with a 100ms sleep inside of it.

At the end, the real sort, will hopefully be able to be run extremely quickly as the keys have all been precalculated.

#!/usr/bin/python

import time
import random
from operator import attrgetter, itemgetter
class Custom:
    def __init__(self):
        self._key = random.randint(1,50000)

    @property
    def key(self):
        # print('slow key')
        time.sleep(0.1)
        return self._key

    def __repr__(self):
        return f"Custom(key={self._key})"

mylist = [Custom() for i in range(40)]

print(mylist)
mylist2 = []
display_inc = 5
display = 0
for x, ele in enumerate(mylist):
    mylist2.append((ele.key,ele))
    if x/len(mylist) * 100 >= display:
        print(f"{display}% done")
        # stop displaying after 90
        if display >= 90:
            display = 110
        display += display_inc
print("95% done")
mylist2.sort(key=itemgetter(0))
mylist = [i[1] for i in mylist2]
print("100% done")
print(mylist)

returns

$ python slowsort.py 
[Custom(key=22549), Custom(key=5431), Custom(key=8895), Custom(key=10837), Custom(key=12652), Custom(key=43897), Custom(key=24724), Custom(key=16014), Custom(key=46022), Custom(key=25979), Custom(key=45115), Custom(key=45442), Custom(key=42306), Custom(key=17611), Custom(key=25113), Custom(key=12924), Custom(key=21902), Custom(key=1661), Custom(key=6475), Custom(key=41993), Custom(key=40334), Custom(key=44407), Custom(key=20747), Custom(key=7635), Custom(key=38258), Custom(key=45187), Custom(key=13048), Custom(key=18952), Custom(key=46592), Custom(key=10790), Custom(key=24978), Custom(key=5349), Custom(key=47924), Custom(key=12413), Custom(key=7147), Custom(key=17528), Custom(key=3035), Custom(key=16639), Custom(key=17059), Custom(key=25630)]
0% done
5% done
10% done
15% done
20% done
25% done
30% done
35% done
40% done
45% done
50% done
55% done
60% done
65% done
70% done
75% done
80% done
85% done
90% done
95% done
100% done
[Custom(key=1661), Custom(key=3035), Custom(key=5349), Custom(key=5431), Custom(key=6475), Custom(key=7147), Custom(key=7635), Custom(key=8895), Custom(key=10790), Custom(key=10837), Custom(key=12413), Custom(key=12652), Custom(key=12924), Custom(key=13048), Custom(key=16014), Custom(key=16639), Custom(key=17059), Custom(key=17528), Custom(key=17611), Custom(key=18952), Custom(key=20747), Custom(key=21902), Custom(key=22549), Custom(key=24724), Custom(key=24978), Custom(key=25113), Custom(key=25630), Custom(key=25979), Custom(key=38258), Custom(key=40334), Custom(key=41993), Custom(key=42306), Custom(key=43897), Custom(key=44407), Custom(key=45115), Custom(key=45187), Custom(key=45442), Custom(key=46022), Custom(key=46592), Custom(key=47924)]

For some reason if the actual sort on the precalculated keys were slow, then you could partition the list into smaller lists, then resorted into one bigger list but that's kinda messy, so I would want to understand if that would be necessary. Hopefully the significant slowness is in the key generation.

-3

Yes, it is possible to create a progress bar for the sorting process in Python. As you correctly pointed out, sorting algorithms can be complex, and it may not be feasible to precisely measure the progress in all cases. However, a rough estimation of the progress can be achieved by using a wrapper class for the sort keys, as you suggested.

To implement this, you can create a wrapper class that holds the original sort key and additional information about the progress. The wrapper class will keep track of the number of elements compared so far and the total number of comparisons made. Based on this information, you can estimate the progress.

Here's a basic outline of how you can achieve this:

import time
from typing import List

class ProgressSortKey:
    def __init__(self, key, total_elements):
        self.key = key
        self.total_elements = total_elements
        self.comparisons_done = 0
        self.elements_compared = 0

    def __lt__(self, other):
        self.comparisons_done += 1
        self.elements_compared += 1
        return self.key < other.key

def sort_with_progress(lst):
    total_elements = len(lst)
    progress_lst = [ProgressSortKey(item, total_elements) for item in lst]

    # Perform the sorting using the modified sort keys
    progress_lst.sort()

    # Return the original objects without the progress information
    return [item.key for item in progress_lst]

# Example usage:
myList: List[Foo] = ...
sorted_list = sort_with_progress(myList)

Now, you can use the sort_with_progress function to sort your list and get an estimate of the progress. Keep in mind that this is a rough estimation and may not be perfectly accurate, especially for very complex sorting algorithms or with very large datasets. However, for your use case, it should give you a good enough indication of the sorting progress.

To visualize the progress as a progress bar, you can create a simple function that prints the progress in the desired format:

def print_progress_bar(iteration, total, prefix='', suffix='', decimals=1, length=50, fill='█'):
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filled_length = int(length * iteration // total)
    bar = fill * filled_length + '-' * (length - filled_length)
    print(f'\r{prefix} |{bar}| {percent}% {suffix}', end='\r')
    # Print new line when progress is complete
    if iteration == total:
        print()

# Example usage for sorting:
for i, item in enumerate(sort_with_progress(myList)):
    # Do something with the sorted item
    # ...

    # Update and print the progress bar
    print_progress_bar(i + 1, len(myList), prefix='Progress:', suffix='Complete', length=50)

This will print a progress bar that updates as the sorting progresses. The bar will be complete when the sorting is done.

Keep in mind that the exact implementation and behavior might depend on the specifics of your custom data type Foo and how the sorting algorithm performs on it, but this outline should give you a starting point to build your progress bar.

1

Not the answer you're looking for? Browse other questions tagged or ask your own question.