Getting first n unique elements from Python list

Question

I have a python list where elements can repeat.

>>> a = [1,2,2,3,3,4,5,6]

I want to get the first n unique elements from the list. So, in this case, if i want the first 5 unique elements, they would be:

[1,2,3,4,5]

I have come up with a solution using generators:

def iterate(itr, upper=5):

    count = 0
    for index, element in enumerate(itr):
        if index==0:
            count += 1
            yield element

        elif element not in itr[:index] and count<upper:
            count += 1
            yield element

In use:

>>> i = iterate(a, 5)
>>> [e for e in i]
[1,2,3,4,5]

I have doubts on this being the most optimal solution. Is there an alternative strategy that i can implement to write it in a more pythonic and efficient way?

@TonyPellerin does not guarantee you get the first 5 elements — juanpa.arrivillaga, Commented Dec 21, 2018 at 16:14
Your code is Pythonic enough, it is just inefficient. element not in itr[:index] is not efficient, use a set — juanpa.arrivillaga, Commented Dec 21, 2018 at 16:16
for the future: if your code works and you need to improve it, it is better to post it on codereview.stackexchange.com — Azat Ibrakov, Commented Dec 21, 2018 at 16:49

Gulzar · Accepted Answer · 2020-12-08 16:56:06Z

52

I would use a set to remember what was seen and return from the generator when you have seen enough:

a = [1, 2, 2, 3, 3, 4, 5, 6]
    
def get_unique_N(iterable, N):
    """Yields (in order) the first N unique elements of iterable. 
    Might yield less if data too short."""
    seen = set()
    for e in iterable:
        if e in seen:
            continue
        seen.add(e)
        yield e
        if len(seen) == N:
            return
            
k = get_unique_N([1, 2, 2, 3, 3, 4, 5, 6], 4)
print(list(k))

Output:

[1, 2, 3, 4]

According to PEP-479 you should return from generators, not raise StopIteration - thanks to @khelwood & @iBug for that piece of comment - one never learns out.

With 3.6 you get a deprecated warning, with 3.7 it gives RuntimeErrors: Transition Plan if still using raise StopIteration

Your solution using elif element not in itr[:index] and count<upper: uses O(k) lookups - with k being the length of the slice - using a set reduces this to O(1) lookups but uses more memory because the set has to be kept as well. It is a speed vs. memory tradeoff - what is better is application/data dependend.

Consider [1, 2, 3, 4, 4, 4, 4, 5] vs [1] * 1000 + [2] * 1000 + [3] * 1000 + [4] * 1000 + [5] * 1000 + [6]:

For 6 uniques (in longer list):

you would have lookups of O(1)+O(2)+...+O(5001)
mine would have 5001*O(1) lookup + memory for set( {1, 2, 3, 4, 5, 6})

edited Dec 8, 2020 at 16:56

Gulzar

26.8k30 gold badges142 silver badges235 bronze badges

answered Dec 21, 2018 at 16:14

Patrick Artner

51.3k9 gold badges47 silver badges75 bronze badges

1

Instead of if e in seen: continue, yield e and return, you could also just return list(seen) at the end.
– mkrieger1
Commented Dec 21, 2018 at 16:19
2

@mkrieger1 That would not guarantee that the items returned would be in the same order they were encountered.
– khelwood
Commented Dec 21, 2018 at 16:20
2

yielding in order :) list(set) not
– Patrick Artner
Commented Dec 21, 2018 at 16:22
Isn't there something like an ordered set?
– mkrieger1
Commented Dec 21, 2018 at 16:23
1

@mkrieger1 yeah, sure, but no built-in ones. You can always use an OrderedDict like a set, or just a plain dict in Python 3.7+
– juanpa.arrivillaga
Commented Dec 21, 2018 at 16:26

Add a comment |

jpp · Accepted Answer · 2018-12-21 17:01:15Z

25

You can adapt the popular itertools unique_everseen recipe:

def unique_everseen_limit(iterable, limit=5):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        if element not in seen:
            seen_add(element)
            yield element
        if len(seen) == limit:
            break

a = [1,2,2,3,3,4,5,6]

res = list(unique_everseen_limit(a))  # [1, 2, 3, 4, 5]

Alternatively, as suggested by @Chris_Rands, you can use itertools.islice to extract a fixed number of values from a non-limited generator:

from itertools import islice

def unique_everseen(iterable):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        if element not in seen:
            seen_add(element)
            yield element

res = list(islice(unique_everseen(a), 5))  # [1, 2, 3, 4, 5]

Note the unique_everseen recipe is available in 3rd party libraries via more_itertools.unique_everseen or toolz.unique, so you could use:

from itertools import islice
from more_itertools import unique_everseen
from toolz import unique

res = list(islice(unique_everseen(a), 5))  # [1, 2, 3, 4, 5]
res = list(islice(unique(a), 5))           # [1, 2, 3, 4, 5]

edited Dec 21, 2018 at 17:01

answered Dec 21, 2018 at 16:16

jpp

163k35 gold badges292 silver badges349 bronze badges

1

The alternative would be creating an infinite generator and then itertools.islice(gen, limit)
– Chris_Rands
Commented Dec 21, 2018 at 16:33
Why not drop line 3 in you first block of code and do seen.add(element) instead?
– gosuto
Commented Dec 27, 2018 at 11:04
@jorijnsmit, It's an optimosation. One less lookup in each iteration of the for loop. You should notice the difference in very large loops.
– jpp
Commented Dec 27, 2018 at 11:28
This 2nd solution is the fastest one as can be seen here.
– Jurgen Strydom
Commented May 4, 2021 at 20:55

Add a comment |

Azat Ibrakov · Accepted Answer · 2018-12-21 16:44:28Z

If your objects are hashable (ints are hashable) you can write utility function using fromkeys method of collections.OrderedDict class (or starting from Python3.7 a plain dict, since they became officially ordered) like

from collections import OrderedDict


def nub(iterable):
    """Returns unique elements preserving order."""
    return OrderedDict.fromkeys(iterable).keys()

and then implementation of iterate can be simplified to

from itertools import islice


def iterate(itr, upper=5):
    return islice(nub(itr), upper)

or if you want always a list as an output

def iterate(itr, upper=5):
    return list(nub(itr))[:upper]

Improvements

As @Chris_Rands mentioned this solution walks through entire collection and we can improve this by writing nub utility in a form of generator like others already did:

def nub(iterable):
    seen = set()
    add_seen = seen.add
    for element in iterable:
        if element in seen:
            continue
        yield element
        add_seen(element)

I was thinking of this, definitely short, but it is O(N)
– Chris_Rands
Commented Dec 21, 2018 at 16:21 — Chris_Rands, Commented Dec 21, 2018 at 16:21

Mazdak · Accepted Answer · 2018-12-21 16:37:45Z

7

Here is a Pythonic approach using itertools.takewhile():

In [95]: from itertools import takewhile

In [96]: seen = set()

In [97]: set(takewhile(lambda x: seen.add(x) or len(seen) <= 4, a))
Out[97]: {1, 2, 3, 4}

answered Dec 21, 2018 at 16:37

Mazdak

107k19 gold badges163 silver badges190 bronze badges

6

By which definition is this abuse of the or operator considered Pythonic?
– cdlane
Commented Dec 21, 2018 at 16:42
2

@cdlane By the definition in which this use of or is misuse.
– Mazdak
Commented Dec 21, 2018 at 16:47
1

I think a proper function should be used instead of a lambda. Here the seen.add is not returning a boolean value, and still being used for truth checking. Your implementation saves us writing a generator function, which is welcome suggestion. But the predicate function should be more explicit.
– xssChauhan
Commented Dec 21, 2018 at 16:50
3

We've different concepts of Pythonic: To be Pythonic is to use the Python constructs and data structures with clean, readable idioms.
– cdlane
Commented Dec 21, 2018 at 17:08
3

I disagree this is Pythonic, seen.add or len(seen) <= 4 shouldn't be used in a function like takewhile, for the smae reasons you wouldn't use it in map or filter
– juanpa.arrivillaga
Commented Dec 21, 2018 at 17:38

Add a comment |

Jindra Helcl · Accepted Answer · 2018-12-21 16:33:29Z

6

You can use OrderedDict or, since Python 3.7, an ordinary dict, since they are implemented to preserve the insertion order. Note that this won't work with sets.

N = 3
a = [1, 2, 2, 3, 3, 3, 4]
d = {x: True for x in a}
list(d.keys())[:N]

edited Dec 21, 2018 at 16:33

answered Dec 21, 2018 at 16:19

Jindra Helcl

3,6361 gold badge23 silver badges27 bronze badges

1

In 3.6 order-preserving dicts were an implementation detail (in the reference implementation... not sure how alternative interpreters handled it). It wasn't official until 3.7.
– glibdud
Commented Dec 21, 2018 at 16:32
I think d = dict.fromkeys(a) would be better.
– user3064538
Commented Jan 15, 2019 at 22:02

Add a comment |

grapes · Accepted Answer · 2018-12-21 16:47:59Z

5

There are really amazing answers for this question, which are fast, compact and brilliant! The reason I am putting here this code is that I believe there are plenty of cases when you don't care about 1 microsecond time loose nor you want additional libraries in your code for one-time solving a simple task.

a = [1,2,2,3,3,4,5,6]
res = []
for x in a:
    if x not in res:  # yes, not optimal, but doesnt need additional dict
        res.append(x)
        if len(res) == 5:
            break
print(res)

edited Dec 21, 2018 at 16:47

answered Dec 21, 2018 at 16:18

grapes

8,4961 gold badge21 silver badges32 bronze badges

me likey. straight forward, verbose, and with a few less lines.
– pangyuteng
Commented Dec 21, 2018 at 16:20
2

Use set rather than list for O(1) lookup.
– jpp
Commented Dec 21, 2018 at 16:20
3

@teng ... inefficient.
– juanpa.arrivillaga
Commented Dec 21, 2018 at 16:20
1

@teng similarly inefficient.
– juanpa.arrivillaga
Commented Dec 21, 2018 at 16:23
1

@grapes but this is time-inefficient. Also, who cares about the line numbers? Do you suffer from a lack of lines? Didn'tsee your response to me. Yes, I agree, this implementation would work and is at least correct. I didn't downvote, btw.
– juanpa.arrivillaga
Commented Dec 21, 2018 at 16:30

| Show 6 more comments

cdlane · Accepted Answer · 2018-12-21 17:49:16Z

5

Assuming the elements are ordered as shown, this is an opportunity to have fun with the groupby function in itertools:

from itertools import groupby, islice

def first_unique(data, upper):
    return islice((key for (key, _) in groupby(data)), 0, upper)

a = [1, 2, 2, 3, 3, 4, 5, 6]

print(list(first_unique(a, 5)))

Updated to use islice instead of enumerate per @juanpa.arrivillaga. You don't even need a set to keep track of duplicates.

edited Dec 21, 2018 at 17:49

answered Dec 21, 2018 at 17:01

cdlane

41.5k5 gold badges35 silver badges83 bronze badges

You might as well use islice
– juanpa.arrivillaga
Commented Dec 21, 2018 at 17:39
So groupby retains order, nice, but is it an implementation detail or a feature?
– kubanczyk
Commented Dec 22, 2018 at 9:39
1

@kubanczyk, yes groupby is mostly used with sorted data, where it becomes an aggregator. If the OP's data weren't sorted, groupby wouldn't work for this problem. However, groupy can be used with unsorted data to solve some other problems. In that case it can be used to detect when data changes.
– cdlane
Commented Dec 22, 2018 at 22:15

Add a comment |

BENY · Accepted Answer · 2018-12-21 16:17:25Z

4

Using set with sorted+ key

sorted(set(a), key=list(a).index)[:5]
Out[136]: [1, 2, 3, 4, 5]

answered Dec 21, 2018 at 16:17

BENY

322k22 gold badges173 silver badges247 bronze badges

2

This is inefficient.
– juanpa.arrivillaga
Commented Dec 21, 2018 at 16:25
4

@xssChauhan this will return it in order, but this is inefficient O(n^2 * log n) I believe. You can do this in O(N)
– juanpa.arrivillaga
Commented Dec 21, 2018 at 16:29

Add a comment |

pylang · Accepted Answer · 2018-12-22 08:59:37Z

4

Given

import itertools as it


a = [1, 2, 2, 3, 3, 4, 5, 6]

Code

A simple list comprehension (similar to @cdlane's answer).

[k for k, _ in it.groupby(a)][:5]
# [1, 2, 3, 4, 5]

Alternatively, in Python 3.6+:

list(dict.fromkeys(a))[:5]
# [1, 2, 3, 4, 5]

edited Dec 22, 2018 at 8:59

answered Dec 21, 2018 at 23:32

pylang

43.3k14 gold badges133 silver badges127 bronze badges

Add a comment |

Jurgen Strydom · Accepted Answer · 2021-05-04 20:46:44Z

Profiling Analysis

Solutions

Which solution is the fastest? There are two clear favorite answers (and 3 solutions) that captured most of the votes.

The solution by Patrick Artner - denoted as PA.
The first solution by jpp - denoted as jpp1
The second solution by jpp - denoted as jpp2

This is because these claim to run in O(N) while others here run in O(N^2), or do not guarantee the order of the returned list.

Experiment setup

For this experiment 3 variables were considered.

N elements. The number of first N elements the function is searching for.
List length. The longer the list the further the algorithm has to look to find the last element.
Repeat limit. How many times an element can repeat before the next element occurs in the list. This is uniformly distributed between 1 and the repeat limit.

The assumptions for data generation were as follows. How strict these are depend on the algorithm used, but is more a note on how the data was generated than a limitation on the algorithms themselves.

The elements never occur again after its repeated sequence first appears in the list.
The elements are numeric and increasing.
The elements are of type int.

So in a list of [1,1,1,2,2,3,4 ....] 1,2,3 would never appear again. The next element after 4 would be 5, but there could be a random number of 4s up to the repeat limit before we see 5.

A new dataset was created for each combination of variables and and re-generated 20 times. The python timeit function was used to profile the algorithms 50 times on each dataset. The mean time of the 20x50=1000 runs (for each combination) were reported here. Since the algorithms are generators, their outputs were converted to a list to get the execution time.

Results

As is expected the more elements searched for, the longer it takes. This graph shows that the execution time is indeed O(N) as claimed by the authors (the straight line proves this).

Fig 1. Varying the first N elements searched for.

All three solutions do not consume additional computation time beyond that which is required. The below image shows what happens when the list is limited in size, and not N elements. Lists of length 10k, with elements repeating a maximum of 100 times (and thus on average repeating 50 times) would on average run out of unique elements by 200 (10000/50). If any of these graphs showed an increase in computation time beyond 200 this would be a cause for concern.

Fig 2. The effect of first N elements chosen > number of unique elements.

The figure below again shows that processing time increases (at a rate of O(N)) the more data the algorithm has to sift through. The rate of increase is the same as when first N elements were varied. This is because stepping through the list is the common execution block in both, and the execution block that ultimately decides how fast the algorithm is.

Fig 3. Varying the repeat limit.

Conclusion

The 2nd solution posted by jpp is the fastest solution of the 3 in all cases. The solution is only slightly faster than the solution posted by Patrick Artner, and is almost twice as fast as his first solution.

This is very useful information. Would it also be possible to add a memory consumption analysis? That way a user could also make a decision considering both their constraints. — xssChauhan, Commented May 5, 2021 at 8:43
I agree, however in this case the information stored in all 3 of the functions are very similar. Furthermore, the dataset processed will be much larger than the information stored, so the memory used by the function is negligible in comparison. — Jurgen Strydom, Commented May 10, 2021 at 7:12

Aleksandr · Accepted Answer · 2018-12-22 12:01:54Z

1

Why not use something like this?

>>> a = [1, 2, 2, 3, 3, 4, 5, 6]
>>> list(set(a))[:5]
[1, 2, 3, 4, 5]

answered Dec 22, 2018 at 12:01

Aleksandr

375 bronze badges

4

If order is not a strict requirement, then this works. Keep in mind, sets are unordered.
– pylang
Commented Dec 25, 2018 at 2:17
2

This is wrong as it may or may not return the first five unique elements.
– Baum mit Augen ♦
Commented Jan 2, 2019 at 9:37

Add a comment |

Quanti Monati · Accepted Answer · 2019-03-13 11:21:08Z

Example list:

a = [1, 2, 2, 3, 3, 4, 5, 6]

Function returns all or count of unique items needed from list

1st argument - list to work with, 2nd argument (optional) - count of unique items (by default - None - it means that all unique elements will be returned)

def unique_elements(lst, number_of_elements=None):
    return list(dict.fromkeys(lst))[:number_of_elements]

Here is example how it works. List name is "a", and we need to get 2 unique elements:

print(unique_elements(a, 2))

Output:

sahasrara62 · Accepted Answer · 2019-05-10 18:47:11Z

0

a = [1,2,2,3,3,4,5,6]

from collections import defaultdict
def function(lis,n):
    dic = defaultdict(int)

    sol=set()

    for i in lis:
            try:
                if dic[i]:
                    pass
                else:
                    sol.add(i)
                    dic[i]=1
                    if len(sol)>=n:
                        break
            except KeyError:
                pass

    return list(sol)

print(function(a,3))

output

[1, 2, 3]

answered May 10, 2019 at 18:47

sahasrara62

10.9k3 gold badges31 silver badges46 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Getting first n unique elements from Python list

13 Answers 13

Improvements

Profiling Analysis

Solutions

Experiment setup

Results

Conclusion

Example list:

Function returns all or count of unique items needed from list

Output:

Not the answer you're looking for? Browse other questions tagged
python
python-3.x
set
generator
unique
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

Improvements

Profiling Analysis

Solutions

Experiment setup

Results

Conclusion

Example list:

Function returns all or count of unique items needed from list

Output:

Not the answer you're looking for? Browse other questions tagged pythonpython-3.xsetgeneratorunique or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
python-3.x
set
generator
unique
or ask your own question.