C# - fastest way of comparing a collection against itself to find duplicates

Question

public class TestObject
{
    string TestValue { get; set; }
    bool IsDuplicate { get; set; }
}

List<TestObject> testList = new List<TestObject>
{
    new TestObject { TestValue = "Matt" },
    new TestObject { TestValue = "Bob" },
    new TestObject { TestValue = "Alice" },
    new TestObject { TestValue = "Matt" },
    new TestObject { TestValue = "Claire" },
    new TestObject { TestValue = "Matt" }
};

Imagine testList is actually millions of objects long.

What's the fastest way to ensure that two of those three TestObjects with TestValue of Matt gets its IsDuplicate set to true? No matter how may instances of a given value there are, only one should come out of the process with IsDuplicate of false.

I am not averse to doing this via threading. And the collection doesn't have to be a list if converting it to another collection type is faster.

I need to keep duplicates and mark them as such, not remove them from the collection.

To expand, this is (as you might imagine) a simple expression of a much more complex problem. The objects in question already have an ordinal which I can use to order them.

After matching initial duplicates on exact string equality, I'm going to have to go back through the collection again and re-try the remainder using some fuzzy matching logic. The collection that exists at the start of this process won't be changed during the deduplication, or afterwards.

Eventually the original collection is going to be written out to a file, with likely duplicates flagged.

I'm not sure if it's the case, but if you just need distinct TestObject entities, then use HashSet. It will serve you the best as it's made to contain only unique instances of specific type. — Anatolyevich, Commented May 26, 2016 at 14:34
I was thinking the same @Anatolyevich, however it doesn't allow the collection to contain the duplicate and mark the duplicates. I'm assuming that is what the OP wanted. — Draken, Commented May 26, 2016 at 14:35
@Nasreddine hastily scribbled pseudocode :) And yes, I need to keep duplicates and mark them. — Bob Tway, Commented May 26, 2016 at 14:36
What is the meaning of the duplicates? Does that mean that you want to preserve order and that order is important for further processing of the collection? What happens with the collection after you mark the duplicates? How are you going to work with those duplicates? Have you considered having a separate HashSet just for the duplicate checking, e.g. when you add a new item, you check if it's already in the HashSet, and if it is, you mark it as duplicate immediately? — Luaan, Commented May 26, 2016 at 14:39

Ivan Yurchenko · Accepted Answer · 2016-05-30 11:07:49Z

As others mentioned, the correct approach here would be to use the HashSet class.

var hashSet = new HashSet<string>();

foreach (var obj in testList)
{
    if (!hashSet.Add(obj.TestValue))
    {
        obj.IsDuplicate = true;
    }
}

When you add a value first time to the HashSet, it adds successfully and HashSet.Add() method returns true so you don't make any changes to the item. When you're trying to add it second time, HashSet.Add() returns false and you mark your item as a duplicate.

The list will have the following state after finishing running our marking duplicates method:

Matt
Bob
Alice
Claire
Matt DUPLICATE

Matthew Watson · Accepted Answer · 2016-05-27 07:25:40Z

2

This is probably quite performant:

foreach (var dupe in testList.GroupBy(x => x.TestValue).SelectMany(g => g.Skip(1)))
    dupe.IsDuplicate = true;

[EDIT] This method turns out to be about a third of the speed of the accepted answer above, so that one should be used. This answer is merely of academic interest.

edited May 27, 2016 at 7:25

answered May 26, 2016 at 15:25

Matthew Watson

108k11 gold badges170 silver badges291 bronze badges

Add a comment |

Steve · Accepted Answer · 2016-05-26 14:54:41Z

Probably I would go to check for the duplicates while building the collection of the TestValue to avoid looping two times on millions of elements. If this scenario is possible then I would use a Dictionary<string, List<TestValue>>

Dictionary<string, List<TestValue>> myList = new Dictionary<string, List<TestValue>>();
while(NotEndOfData())
{
     TestValue obj = GetTestValue();
     if(myList.ContainsKey(obj.Name))
     {
         obj.IsDuplicate = true;
         myList[obj.Name].Add(obj);
     }
     else
     {
         obj.IsDuplicate = false;
         myList.Add(obj.Name, new List<TestValue>() { obj};
     }
}

dotNET · Accepted Answer · 2016-05-26 15:01:09Z

1

SortedSet<string> sorted = new SortedSet<string>();
for (int i = 0; i < testList.Count; i++)
  testList[i].IsDuplicate = !sorted.Add(testList[i].TestValue);

As you have allowed in the question, I'd change testList to be an array instead of a list, to make indexer faster.

edited May 26, 2016 at 15:01

answered May 26, 2016 at 14:55

dotNET

34.8k26 gold badges170 silver badges265 bronze badges

Add a comment |

Pradeep Kumar · Accepted Answer · 2016-05-26 15:05:56Z

Since you indicated that you have a property that keeps the ordinal of your items. We can use that property to reset the sort order back to its original after marking our items as duplicates.

The code below is self-explainatory. But just let me know in case you need any further explaination.

I have assumed that the property name is SortOrder. Modify the code accordingly.

void MarkDuplicates()
{
    testList = testList.OrderBy(f => f.TestValue).ThenBy(f => f.SortOrder).ToList();
    for (int i = 1; i < testList.Count; i++) 
    {
        if (testList[i].TestValue == testList[i - 1].TestValue) testList[i].IsDuplicate = true;
    }
    testList = testList.OrderBy(f => f.SortOrder).ToList();
}

I'm not a performance expert. But you can time the various solutions provided here and check the performance for yourself.

Collectives™ on Stack Overflow

C# - fastest way of comparing a collection against itself to find duplicates

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
c#
collections
duplicates
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged c#collectionsduplicates or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
c#
collections
duplicates
or ask your own question.