47

I have a python list where elements can repeat.

>>> a = [1,2,2,3,3,4,5,6]

I want to get the first n unique elements from the list. So, in this case, if i want the first 5 unique elements, they would be:

[1,2,3,4,5]

I have come up with a solution using generators:

def iterate(itr, upper=5):

    count = 0
    for index, element in enumerate(itr):
        if index==0:
            count += 1
            yield element

        elif element not in itr[:index] and count<upper:
            count += 1
            yield element

In use:

>>> i = iterate(a, 5)
>>> [e for e in i]
[1,2,3,4,5]

I have doubts on this being the most optimal solution. Is there an alternative strategy that i can implement to write it in a more pythonic and efficient way?

9
  • 5
    Try: set(a)[:n] Commented Dec 21, 2018 at 16:11
  • 15
    @TonyPellerin does not guarantee you get the first 5 elements Commented Dec 21, 2018 at 16:14
  • 2
    Your code is Pythonic enough, it is just inefficient. element not in itr[:index] is not efficient, use a set Commented Dec 21, 2018 at 16:16
  • 3
    Is the list always sorted? Commented Dec 21, 2018 at 16:16
  • 5
    for the future: if your code works and you need to improve it, it is better to post it on codereview.stackexchange.com Commented Dec 21, 2018 at 16:49

13 Answers 13

52

I would use a set to remember what was seen and return from the generator when you have seen enough:

a = [1, 2, 2, 3, 3, 4, 5, 6]
    
def get_unique_N(iterable, N):
    """Yields (in order) the first N unique elements of iterable. 
    Might yield less if data too short."""
    seen = set()
    for e in iterable:
        if e in seen:
            continue
        seen.add(e)
        yield e
        if len(seen) == N:
            return
            
k = get_unique_N([1, 2, 2, 3, 3, 4, 5, 6], 4)
print(list(k))
    

Output:

[1, 2, 3, 4]

According to PEP-479 you should return from generators, not raise StopIteration - thanks to @khelwood & @iBug for that piece of comment - one never learns out.

With 3.6 you get a deprecated warning, with 3.7 it gives RuntimeErrors: Transition Plan if still using raise StopIteration


Your solution using elif element not in itr[:index] and count<upper: uses O(k) lookups - with k being the length of the slice - using a set reduces this to O(1) lookups but uses more memory because the set has to be kept as well. It is a speed vs. memory tradeoff - what is better is application/data dependend.

Consider [1, 2, 3, 4, 4, 4, 4, 5] vs [1] * 1000 + [2] * 1000 + [3] * 1000 + [4] * 1000 + [5] * 1000 + [6]:

For 6 uniques (in longer list):

  • you would have lookups of O(1)+O(2)+...+O(5001)
  • mine would have 5001*O(1) lookup + memory for set( {1, 2, 3, 4, 5, 6})
5
  • 1
    Instead of if e in seen: continue, yield e and return, you could also just return list(seen) at the end.
    – mkrieger1
    Commented Dec 21, 2018 at 16:19
  • 2
    @mkrieger1 That would not guarantee that the items returned would be in the same order they were encountered.
    – khelwood
    Commented Dec 21, 2018 at 16:20
  • 2
    yielding in order :) list(set) not Commented Dec 21, 2018 at 16:22
  • Isn't there something like an ordered set?
    – mkrieger1
    Commented Dec 21, 2018 at 16:23
  • 1
    @mkrieger1 yeah, sure, but no built-in ones. You can always use an OrderedDict like a set, or just a plain dict in Python 3.7+ Commented Dec 21, 2018 at 16:26
25

You can adapt the popular itertools unique_everseen recipe:

def unique_everseen_limit(iterable, limit=5):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        if element not in seen:
            seen_add(element)
            yield element
        if len(seen) == limit:
            break

a = [1,2,2,3,3,4,5,6]

res = list(unique_everseen_limit(a))  # [1, 2, 3, 4, 5]

Alternatively, as suggested by @Chris_Rands, you can use itertools.islice to extract a fixed number of values from a non-limited generator:

from itertools import islice

def unique_everseen(iterable):
    seen = set()
    seen_add = seen.add
    for element in iterable:
        if element not in seen:
            seen_add(element)
            yield element

res = list(islice(unique_everseen(a), 5))  # [1, 2, 3, 4, 5]

Note the unique_everseen recipe is available in 3rd party libraries via more_itertools.unique_everseen or toolz.unique, so you could use:

from itertools import islice
from more_itertools import unique_everseen
from toolz import unique

res = list(islice(unique_everseen(a), 5))  # [1, 2, 3, 4, 5]
res = list(islice(unique(a), 5))           # [1, 2, 3, 4, 5]
4
  • 1
    The alternative would be creating an infinite generator and then itertools.islice(gen, limit) Commented Dec 21, 2018 at 16:33
  • Why not drop line 3 in you first block of code and do seen.add(element) instead?
    – gosuto
    Commented Dec 27, 2018 at 11:04
  • @jorijnsmit, It's an optimosation. One less lookup in each iteration of the for loop. You should notice the difference in very large loops.
    – jpp
    Commented Dec 27, 2018 at 11:28
  • This 2nd solution is the fastest one as can be seen here. Commented May 4, 2021 at 20:55
9

If your objects are hashable (ints are hashable) you can write utility function using fromkeys method of collections.OrderedDict class (or starting from Python3.7 a plain dict, since they became officially ordered) like

from collections import OrderedDict


def nub(iterable):
    """Returns unique elements preserving order."""
    return OrderedDict.fromkeys(iterable).keys()

and then implementation of iterate can be simplified to

from itertools import islice


def iterate(itr, upper=5):
    return islice(nub(itr), upper)

or if you want always a list as an output

def iterate(itr, upper=5):
    return list(nub(itr))[:upper]

Improvements

As @Chris_Rands mentioned this solution walks through entire collection and we can improve this by writing nub utility in a form of generator like others already did:

def nub(iterable):
    seen = set()
    add_seen = seen.add
    for element in iterable:
        if element in seen:
            continue
        yield element
        add_seen(element)
1
  • 1
    I was thinking of this, definitely short, but it is O(N) Commented Dec 21, 2018 at 16:21
7

Here is a Pythonic approach using itertools.takewhile():

In [95]: from itertools import takewhile

In [96]: seen = set()

In [97]: set(takewhile(lambda x: seen.add(x) or len(seen) <= 4, a))
Out[97]: {1, 2, 3, 4}
5
  • 6
    By which definition is this abuse of the or operator considered Pythonic?
    – cdlane
    Commented Dec 21, 2018 at 16:42
  • 2
    @cdlane By the definition in which this use of or is misuse.
    – Mazdak
    Commented Dec 21, 2018 at 16:47
  • 1
    I think a proper function should be used instead of a lambda. Here the seen.add is not returning a boolean value, and still being used for truth checking. Your implementation saves us writing a generator function, which is welcome suggestion. But the predicate function should be more explicit.
    – xssChauhan
    Commented Dec 21, 2018 at 16:50
  • 3
    We've different concepts of Pythonic: To be Pythonic is to use the Python constructs and data structures with clean, readable idioms.
    – cdlane
    Commented Dec 21, 2018 at 17:08
  • 3
    I disagree this is Pythonic, seen.add or len(seen) <= 4 shouldn't be used in a function like takewhile, for the smae reasons you wouldn't use it in map or filter Commented Dec 21, 2018 at 17:38
6

You can use OrderedDict or, since Python 3.7, an ordinary dict, since they are implemented to preserve the insertion order. Note that this won't work with sets.

N = 3
a = [1, 2, 2, 3, 3, 3, 4]
d = {x: True for x in a}
list(d.keys())[:N]
2
  • 1
    In 3.6 order-preserving dicts were an implementation detail (in the reference implementation... not sure how alternative interpreters handled it). It wasn't official until 3.7.
    – glibdud
    Commented Dec 21, 2018 at 16:32
  • I think d = dict.fromkeys(a) would be better.
    – user3064538
    Commented Jan 15, 2019 at 22:02
5

There are really amazing answers for this question, which are fast, compact and brilliant! The reason I am putting here this code is that I believe there are plenty of cases when you don't care about 1 microsecond time loose nor you want additional libraries in your code for one-time solving a simple task.

a = [1,2,2,3,3,4,5,6]
res = []
for x in a:
    if x not in res:  # yes, not optimal, but doesnt need additional dict
        res.append(x)
        if len(res) == 5:
            break
print(res)
11
  • me likey. straight forward, verbose, and with a few less lines.
    – pangyuteng
    Commented Dec 21, 2018 at 16:20
  • 2
    Use set rather than list for O(1) lookup.
    – jpp
    Commented Dec 21, 2018 at 16:20
  • 3
    @teng ... inefficient. Commented Dec 21, 2018 at 16:20
  • 1
    @teng similarly inefficient. Commented Dec 21, 2018 at 16:23
  • 1
    @grapes but this is time-inefficient. Also, who cares about the line numbers? Do you suffer from a lack of lines? Didn'tsee your response to me. Yes, I agree, this implementation would work and is at least correct. I didn't downvote, btw. Commented Dec 21, 2018 at 16:30
5

Assuming the elements are ordered as shown, this is an opportunity to have fun with the groupby function in itertools:

from itertools import groupby, islice

def first_unique(data, upper):
    return islice((key for (key, _) in groupby(data)), 0, upper)

a = [1, 2, 2, 3, 3, 4, 5, 6]

print(list(first_unique(a, 5)))

Updated to use islice instead of enumerate per @juanpa.arrivillaga. You don't even need a set to keep track of duplicates.

3
  • You might as well use islice Commented Dec 21, 2018 at 17:39
  • So groupby retains order, nice, but is it an implementation detail or a feature?
    – kubanczyk
    Commented Dec 22, 2018 at 9:39
  • 1
    @kubanczyk, yes groupby is mostly used with sorted data, where it becomes an aggregator. If the OP's data weren't sorted, groupby wouldn't work for this problem. However, groupy can be used with unsorted data to solve some other problems. In that case it can be used to detect when data changes.
    – cdlane
    Commented Dec 22, 2018 at 22:15
4

Using set with sorted+ key

sorted(set(a), key=list(a).index)[:5]
Out[136]: [1, 2, 3, 4, 5]
2
  • 2
    This is inefficient. Commented Dec 21, 2018 at 16:25
  • 4
    @xssChauhan this will return it in order, but this is inefficient O(n^2 * log n) I believe. You can do this in O(N) Commented Dec 21, 2018 at 16:29
4

Given

import itertools as it


a = [1, 2, 2, 3, 3, 4, 5, 6]

Code

A simple list comprehension (similar to @cdlane's answer).

[k for k, _ in it.groupby(a)][:5]
# [1, 2, 3, 4, 5]

Alternatively, in Python 3.6+:

list(dict.fromkeys(a))[:5]
# [1, 2, 3, 4, 5]
2

Profiling Analysis

Solutions

Which solution is the fastest? There are two clear favorite answers (and 3 solutions) that captured most of the votes.

  1. The solution by Patrick Artner - denoted as PA.
  2. The first solution by jpp - denoted as jpp1
  3. The second solution by jpp - denoted as jpp2

This is because these claim to run in O(N) while others here run in O(N^2), or do not guarantee the order of the returned list.

Experiment setup

For this experiment 3 variables were considered.

  1. N elements. The number of first N elements the function is searching for.
  2. List length. The longer the list the further the algorithm has to look to find the last element.
  3. Repeat limit. How many times an element can repeat before the next element occurs in the list. This is uniformly distributed between 1 and the repeat limit.

The assumptions for data generation were as follows. How strict these are depend on the algorithm used, but is more a note on how the data was generated than a limitation on the algorithms themselves.

  1. The elements never occur again after its repeated sequence first appears in the list.
  2. The elements are numeric and increasing.
  3. The elements are of type int.

So in a list of [1,1,1,2,2,3,4 ....] 1,2,3 would never appear again. The next element after 4 would be 5, but there could be a random number of 4s up to the repeat limit before we see 5.

A new dataset was created for each combination of variables and and re-generated 20 times. The python timeit function was used to profile the algorithms 50 times on each dataset. The mean time of the 20x50=1000 runs (for each combination) were reported here. Since the algorithms are generators, their outputs were converted to a list to get the execution time.

Results

As is expected the more elements searched for, the longer it takes. This graph shows that the execution time is indeed O(N) as claimed by the authors (the straight line proves this).

Fig 1. Varying the first N elements searched for.

Fig 1. Varying the first N elements searched for.

All three solutions do not consume additional computation time beyond that which is required. The below image shows what happens when the list is limited in size, and not N elements. Lists of length 10k, with elements repeating a maximum of 100 times (and thus on average repeating 50 times) would on average run out of unique elements by 200 (10000/50). If any of these graphs showed an increase in computation time beyond 200 this would be a cause for concern.

Fig 2. The effect of first N elements chosen > number of unique elements.

Fig 2. The effect of first N elements chosen > number of unique elements.

The figure below again shows that processing time increases (at a rate of O(N)) the more data the algorithm has to sift through. The rate of increase is the same as when first N elements were varied. This is because stepping through the list is the common execution block in both, and the execution block that ultimately decides how fast the algorithm is.

Fig 3. Varying the repeat limit.

Fig 3. Varying the repeat limit.

Conclusion

The 2nd solution posted by jpp is the fastest solution of the 3 in all cases. The solution is only slightly faster than the solution posted by Patrick Artner, and is almost twice as fast as his first solution.

2
  • This is very useful information. Would it also be possible to add a memory consumption analysis? That way a user could also make a decision considering both their constraints.
    – xssChauhan
    Commented May 5, 2021 at 8:43
  • I agree, however in this case the information stored in all 3 of the functions are very similar. Furthermore, the dataset processed will be much larger than the information stored, so the memory used by the function is negligible in comparison. Commented May 10, 2021 at 7:12
1

Why not use something like this?

>>> a = [1, 2, 2, 3, 3, 4, 5, 6]
>>> list(set(a))[:5]
[1, 2, 3, 4, 5]
2
  • 4
    If order is not a strict requirement, then this works. Keep in mind, sets are unordered.
    – pylang
    Commented Dec 25, 2018 at 2:17
  • 2
    This is wrong as it may or may not return the first five unique elements.
    – Baum mit Augen
    Commented Jan 2, 2019 at 9:37
0

Example list:

a = [1, 2, 2, 3, 3, 4, 5, 6]

Function returns all or count of unique items needed from list

1st argument - list to work with, 2nd argument (optional) - count of unique items (by default - None - it means that all unique elements will be returned)

def unique_elements(lst, number_of_elements=None):
    return list(dict.fromkeys(lst))[:number_of_elements]

Here is example how it works. List name is "a", and we need to get 2 unique elements:

print(unique_elements(a, 2))

Output:

output

0
a = [1,2,2,3,3,4,5,6]

from collections import defaultdict
def function(lis,n):
    dic = defaultdict(int)

    sol=set()

    for i in lis:
            try:
                if dic[i]:
                    pass
                else:
                    sol.add(i)
                    dic[i]=1
                    if len(sol)>=n:
                        break
            except KeyError:
                pass

    return list(sol)

print(function(a,3))

output

[1, 2, 3]

Not the answer you're looking for? Browse other questions tagged or ask your own question.