109

I want to remove duplicates from list, without changing order of unique elements in the list.

Jon Skeet & others have suggested to use the following:

list = list.Distinct().ToList();

Reference:

Is it guaranteed that the order of unique elements would be same as before? If yes, please give a reference that confirms this as I couldn't find anything on it in documentation.

5
  • 7
    @ColonelPanic - official documentation here msdn.microsoft.com/en-us/library/bb348436(v=vs.110).aspx states explicitly "Distinct() method returns an unordered sequence that contains no duplicate values".
    – Evk
    Commented Nov 22, 2017 at 20:51
  • @Evk 'Unordered sequence' is not same as 'original ordering of sequence'.
    – Nitesh
    Commented Nov 23, 2017 at 11:19
  • 4
    I consider "unoredered" to mean "in no particular order", which also implies "not necessary in original order of sequence".
    – Evk
    Commented Nov 26, 2017 at 9:55
  • I just had a problem regarding distinct with oracle12 Entity Framework 6. In my case I had orderby before disinct in my linq clause and the order was gone. select().OrderBy().Distinct().ToList() did not work while select().OrderBy().Distinct().ToList() worked.
    – Karl
    Commented Apr 5, 2018 at 6:36
  • 5
    @Karl, these expressions are the same. :)
    – pvgoran
    Commented Sep 24, 2018 at 4:38

7 Answers 7

87

It's not guaranteed, but it's the most obvious implementation. It would be hard to implement in a streaming manner (i.e. such that it returned results as soon as it could, having read as little as it could) without returning them in order.

You might want to read my blog post on the Edulinq implementation of Distinct().

Note that even if this were guaranteed for LINQ to Objects (which personally I think it should be) that wouldn't mean anything for other LINQ providers such as LINQ to SQL.

The level of guarantees provided within LINQ to Objects is a little inconsistent sometimes, IMO. Some optimizations are documented, others not. Heck, some of the documentation is flat out wrong.

7
  • I am accepting it because 1) It clearly answers my concern whether its guaranteed or not 2) The linked post delves deeper into undocumented aspects of Distinct 3) The linked post also has a sample implementation that can be used as reference to implement a Distinct on Lists with that guarantee.
    – Nitesh
    Commented Jan 19, 2011 at 13:28
  • I note that if an input collection is already known to be sorted then .Distinct() could operate in a streaming manner (by simply skipping the current value if it's the same as the last value). This approach could also work for any finite, in-memory IReadOnlyCollection<T> by iterating through it once to check if it is already sorted. don't believe that .Distinct() checks for IOrderedEnumerable<T>, grrr. Despite Linq having been around for 13 years now, I'm surprised it isn't smarter about applying these kinds of optimizations.
    – Dai
    Commented Mar 5, 2021 at 10:57
  • It would be nice if List<T> had a bool IsSorted property which is true after List<T>.Sort() runs and turns false if any reordering happens - that way .Distinct() could be optimized for that. Oh well....
    – Dai
    Commented Mar 5, 2021 at 10:59
  • 1
    @Dai: That sounds generally infeasible - just because something was sorted doesn't mean that the data inside the relevant objects is the same as it was at the time of sorting.
    – Jon Skeet
    Commented Mar 5, 2021 at 11:02
  • @JonSkeet Sorry, I've been working with immutable types a lot lately, I forgot about how mutability of data means we can't have nice things. Though it could still work for record-types in C# 9.0.
    – Dai
    Commented Mar 5, 2021 at 11:04
32

In the .NET Framework 3.5, disassembling the CIL of the Linq-to-Objects implementation of Distinct() shows that the order of elements is preserved - however this is not documented behavior.

I did a little investigation with Reflector. After disassembling System.Core.dll, Version=3.5.0.0 you can see that Distinct() is an extension method, which looks like this:

public static class Emunmerable
{
    public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source)
    {
        if (source == null)
            throw new ArgumentNullException("source");

        return DistinctIterator<TSource>(source, null);
    }
}

So, interesting here is DistinctIterator, which implements IEnumerable and IEnumerator. Here is simplified (goto and lables removed) implementation of this IEnumerator:

private sealed class DistinctIterator<TSource> : IEnumerable<TSource>, IEnumerable, IEnumerator<TSource>, IEnumerator, IDisposable
{
    private bool _enumeratingStarted;
    private IEnumerator<TSource> _sourceListEnumerator;
    public IEnumerable<TSource> _source;
    private HashSet<TSource> _hashSet;    
    private TSource _current;

    private bool MoveNext()
    {
        if (!_enumeratingStarted)
        {
            _sourceListEnumerator = _source.GetEnumerator();
            _hashSet = new HashSet<TSource>();
            _enumeratingStarted = true;
        }

        while(_sourceListEnumerator.MoveNext())
        {
            TSource element = _sourceListEnumerator.Current;

             if (!_hashSet.Add(element))
                 continue;

             _current = element;
             return true;
        }

        return false;
    }

    void IEnumerator.Reset()
    {
        throw new NotSupportedException();
    }

    TSource IEnumerator<TSource>.Current
    {
        get { return _current; }
    }

    object IEnumerator.Current
    {        
        get { return _current; }
    }
}

As you can see - enumerating goes in order provided by source enumerable (list, on which we are calling Distinct). Hashset is used only for determining whether we already returned such element or not. If not, we are returning it, else - continue enumerating on source.

So, it is guaranteed, that Distinct() will return elements exactly in same order, which are provided by collection to which Distinct was applied.

10
  • 9
    Is it a well documented behavior? Commented Jan 19, 2011 at 11:49
  • 5
    The linked answer contains a reference to documentation that says: "The result sequence is unordered."
    – mgronber
    Commented Jan 19, 2011 at 12:22
  • 6
    @lazyberezovsky: The question asks about guarantees, not common implementation. (As I already said, I'd be surprised if the implementation ever does change across platforms/versions, but that doesn't amount to a guarantee.)
    – LukeH
    Commented Jan 19, 2011 at 13:36
  • 5
    @lazyberezovsky: I am from C\C++ where a lot of things are undefined and its very common to ask for if something is guaranteed. Also I am using Distinct() in a Silverlight application, which is on both Mac and Windows thats why we cannot settle on 'common implementation' it must be guaranteed.
    – Nitesh
    Commented Jan 19, 2011 at 13:55
  • 45
    @lazyberezovsky: When people talk about guarantees, they normally mean documented behaviour which is reasonable to rely upon. For example, the docs for GroupBy do specify behaviour, but the docs for Distinct don't.
    – Jon Skeet
    Commented Jan 19, 2011 at 13:57
16

According to the documentation the sequence is unordered.

3
  • 3
    Additional info to find it: In the link, refer to the "Remarks" section. "The result sequence is unordered." Commented May 6, 2014 at 23:46
  • Link refers to Enumerable.Distinct rather than the .Distinct extension method Commented Nov 19, 2021 at 1:07
  • 1
    Those are the extension methods as can be seen on the method signatures. The Enumerable class provides a set of static methods for querying objects that implement IEnumerable<T>. :-)
    – mgronber
    Commented Dec 8, 2021 at 14:15
7

Yes, Enumerable.Distinct preserves order. Assuming the method to be lazy "yields distinct values are soon as they are seen", it follows automatically. Think about it.

The .NET Reference source confirms. It returns a subsequence, the first element in each equivalence class.

foreach (TSource element in source)
    if (set.Add(element)) yield return element;

The .NET Core implementation is similar.

Frustratingly, the documentation for Enumerable.Distinct is confused on this point:

The result sequence is unordered.

I can only imagine they mean "the result sequence is not sorted." You could implement Distinct by presorting then comparing each element to the previous, but this would not be lazy as defined above.

4
  • 12
    The source isn't the specification. What you found is a coincidence and could be invalid after the next update. Commented Nov 24, 2017 at 13:08
  • 1
    @HenkHolterman In general, I'd agree, implementations can change. For example, .NET 4.5 changed the sorting algorithm behind Array.Sort. In this particular case however, any sensible implementation of Enumerable.Distinct will surely be lazy ("yields distinct values are soon as they are seen"), and the order-preserving property follows from that. Lazy evaluation is a core tenet of LINQ to Objects; to rescind it would be unthinkable. Commented Nov 30, 2017 at 12:17
  • 1
    I've seen implementations using .net 4.6 where calling dbQuery.OrderBy(...).Distinct().ToList() does not return a list in the order specified by the order by predicate - removing the Distinct (which happened to be redundant) fixed the bug in my case Commented Jun 20, 2018 at 16:28
  • 1
    @RowlandShawMight Queryable is different from Enumerable. You should check the query that is generated.
    – Wouter
    Commented Jul 1, 2021 at 10:30
5

A bit late to the party, but no one really posted the best complete code to accomplish this IMO, so let me offer this (which is essentially identical to what .NET Framework does with Distinct())*:

    public static IEnumerable<T> DistinctOrdered<T>(this IEnumerable<T> items)
    {
        HashSet<T> returnedItems = new HashSet<T>();
        foreach (var item in items)
        {
            if (returnedItems.Add(item))
                yield return item;
        }                       
    }

This guarantees the original order without reliance on undocumented or assumed behavior. I also believe this is more efficient than using multiple LINQ methods though I'm open to being corrected here.

(*) The .NET Framework source uses an internal Set class, which appears to be substantively identical to HashSet.

1
  • 3
    It doesn't matter because the order will be dictated by the enumerable being provided. The HashSet just tells us whether the item has been returned yet and if so skips it. Commented Nov 20, 2021 at 11:31
1

By default when use Distinct linq operator uses Equals method but you can use your own IEqualityComparer<T> object to specify when two objects are equals with a custom logic implementing GetHashCode and Equals method. Remember that:

GetHashCode should not used heavy cpu comparision ( eg. use only some obvious basic checks ) and its used as first to state if two object are surely different ( if different hash code are returned ) or potentially the same ( same hash code ). In this latest case when two object have the same hashcode the framework will step to check using the Equals method as a final decision about equality of given objects.

After you have MyType and a MyTypeEqualityComparer classes follow code not ensure the sequence maintain its order:

var cmp = new MyTypeEqualityComparer();
var lst = new List<MyType>();
// add some to lst
var q = lst.Distinct(cmp);

In follow sci library I implemented an extension method to ensure Vector3D set maintain the order when use a specific extension method DistinctKeepOrder:

relevant code follows:

/// <summary>
/// support class for DistinctKeepOrder extension
/// </summary>
public class Vector3DWithOrder
{
    public int Order { get; private set; }
    public Vector3D Vector { get; private set; }
    public Vector3DWithOrder(Vector3D v, int order)
    {
        Vector = v;
        Order = order;
    }
}

public class Vector3DWithOrderEqualityComparer : IEqualityComparer<Vector3DWithOrder>
{
    Vector3DEqualityComparer cmp;

    public Vector3DWithOrderEqualityComparer(Vector3DEqualityComparer _cmp)
    {
        cmp = _cmp;
    }

    public bool Equals(Vector3DWithOrder x, Vector3DWithOrder y)
    {
        return cmp.Equals(x.Vector, y.Vector);
    }

    public int GetHashCode(Vector3DWithOrder obj)
    {
        return cmp.GetHashCode(obj.Vector);
    }
}

In short Vector3DWithOrder encapsulate the type and an order integer, while Vector3DWithOrderEqualityComparer encapsulates original type comparer.

and this is the method helper to ensure order maintained

/// <summary>
/// retrieve distinct of given vector set ensuring to maintain given order
/// </summary>        
public static IEnumerable<Vector3D> DistinctKeepOrder(this IEnumerable<Vector3D> vectors, Vector3DEqualityComparer cmp)
{
    var ocmp = new Vector3DWithOrderEqualityComparer(cmp);

    return vectors
        .Select((w, i) => new Vector3DWithOrder(w, i))
        .Distinct(ocmp)
        .OrderBy(w => w.Order)
        .Select(w => w.Vector);
}

Note : further research could allow to find a more general ( uses of interfaces ) and optimized way ( without encapsulate the object ).

1

This highly depends on your linq-provider. On Linq2Objects you can stay on the internal source-code for Distinct, which makes one assume the original order is preserved.

However for other providers that resolve to some kind of SQL for example, that isn´t neccessarily the case, as an ORDER BY-statement usually comes after any aggregation (such as Distinct). So if your code is this:

myArray.OrderBy(x => anothercol).GroupBy(x => y.mycol);

this is translated to something similar to the following in SQL:

SELECT * FROM mytable GROUP BY mycol ORDER BY anothercol;

This obviously first groups your data and sorts it afterwards. Now you´re stuck on the DBMS own logic of how to execute that. On some DBMS this isn´t even allowed. Imagine the following data:

mycol anothercol
1     2
1     1
1     3
2     1
2     3

when executing myArr.OrderBy(x => x.anothercol).GroupBy(x => x.mycol) we assume the following result:

mycol anothercol
1     1
2     1

But the DBMS may aggregate the anothercol-column so, that allways the value of the first row is used, resulting in the following data:

mycol anothercol
1    2
2    1

which after ordering will result in this:

mycol anothercol
2    1
1    2

This is similar to the following:

SELECT mycol, First(anothercol) from mytable group by mycol order by anothercol;

which is the completely reverse order than what you expected.

You see the execution-plan may vary depending on what the underlying provider is. This is why there´s no guarantee about that in the docs.

Not the answer you're looking for? Browse other questions tagged or ask your own question.