public class TestObject
{
string TestValue { get; set; }
bool IsDuplicate { get; set; }
}
List<TestObject> testList = new List<TestObject>
{
new TestObject { TestValue = "Matt" },
new TestObject { TestValue = "Bob" },
new TestObject { TestValue = "Alice" },
new TestObject { TestValue = "Matt" },
new TestObject { TestValue = "Claire" },
new TestObject { TestValue = "Matt" }
};
Imagine testList
is actually millions of objects long.
What's the fastest way to ensure that two of those three TestObjects
with TestValue
of Matt gets its IsDuplicate
set to true? No matter how may instances of a given value there are, only one should come out of the process with IsDuplicate of false.
I am not averse to doing this via threading. And the collection doesn't have to be a list if converting it to another collection type is faster.
I need to keep duplicates and mark them as such, not remove them from the collection.
To expand, this is (as you might imagine) a simple expression of a much more complex problem. The objects in question already have an ordinal which I can use to order them.
After matching initial duplicates on exact string equality, I'm going to have to go back through the collection again and re-try the remainder using some fuzzy matching logic. The collection that exists at the start of this process won't be changed during the deduplication, or afterwards.
Eventually the original collection is going to be written out to a file, with likely duplicates flagged.
HashSet
just for the duplicate checking, e.g. when you add a new item, you check if it's already in theHashSet
, and if it is, you mark it as duplicate immediately?Matt
in the list?