Bug
Your logic is flawed when computing the union of the sets: it doesn't include data from set1
in the union of sets. This leads to incorrect results in many cases.
Style
C# conventions recommend using PascalCase for method names. This makes the code easier to read for anyone who has some C# experience.
Method names should have a verb describing the action they perform. Again, this improves readability.
Remove unused using
statements. These just add useless noise to the code.
Side effects
Glossing over the bugged logic, your method mutates its inputs, which means it isn't possible to do any further work with your sets after computing their Jaccard index, which will most likely cause issues.
Prefer using LINQ
methods Union
and Intersect
over HashSet
's UnionWith
and IntersectWith
, as the former return a new IEnumerable
instead of mutating the calling instance.
Make your method static
Your method doesn't require any class members, and should be made static in order to use it without instantiating a class.
Make it generic
There is no reason to limit your method to work with HashSet<int>
, as the logic doesn't rely on the underlying data type.
Fewer operations
Set operations are the most expensive operations in your code. You can save computing the set union and just work with the counts of inputs and intersection.
Document your code
This makes the code easier to reuse by other people, or yourself in the future.
Include tests
If your code was properly tested, you would have caught your bug.
Sample code:
JaccardIndex.cs
namespace JaccardIndex
{
public static class JaccardIndex
{
/// <summary>
/// Compute the Jaccard index between two sets.
/// The Jaccard index gauges the similarity between two sets, and ranges from 0.0 (disjoint sets) to 1.0 (identical sets).
/// </summary>
/// <typeparam name="T">Underlying data types of input sets</typeparam>
/// <param name="set1">1st input set</param>
/// <param name="set2">2nd imput set</param>
/// <returns>Jaccard index</returns>
public static double ComputeJaccardIndex<T>(HashSet<T> set1, HashSet<T> set2)
{
var intesectionCount = (double)set1.Intersect(set2).Count();
return intesectionCount / (set1.Count() + set2.Count() - intesectionCount);
}
}
}
JaccardIndexTests.cs
using Xunit;
using static JaccardIndex.JaccardIndex;
namespace JaccardIndexTests
{
public class JaccardIndexTests
{
/// <summary>
/// Verify that some knwon values are computed correctly
/// </summary>
[Theory]
[InlineData(new int[] { 1, 2, 3 }, new int[] { 1 }, 1.0 / 3.0)]
[InlineData(new int[] { 1 }, new int[] { 1 }, 1.0 )]
[InlineData(new int[] { 1 }, new int[] { 2 }, 0.0)]
[InlineData(new int[] { 1, 2, 3 }, new int[] { 1, 2, 4 }, 2.0 / 4.0)]
[InlineData(new int[] { 1, 2, 3 }, new int[] { 4, 5, 6 }, 0.0)]
public void Samples(int[] data1, int[] data2, double expected)
{
var set1 = new HashSet<int>(data1);
var set2 = new HashSet<int>(data2);
Assert.Equal(expected, ComputeJaccardIndex(set1, set2));
}
/// <summary>
/// Jaccard index of a should be commutative
/// </summary>
[Theory]
[InlineData(new int[] { 1, 2, 3 }, new int[] { 1 })]
[InlineData(new int[] { 1 }, new int[] { 1 })]
[InlineData(new int[] { 1 }, new int[] { 2 })]
[InlineData(new int[] { 1, 2, 3 }, new int[] { 1, 2, 4 })]
[InlineData(new int[] { 1, 2, 3 }, new int[] { 4, 5, 6 })]
public void Symmetry(int[] data1, int[] data2)
{
var set1 = new HashSet<int>(data1);
var set2 = new HashSet<int>(data2);
Assert.Equal(ComputeJaccardIndex(set1, set2), ComputeJaccardIndex(set2, set1));
}
/// <summary>
/// Jaccard index of a set with itelf should be 1
/// </summary>
[Theory]
[InlineData(new int[] { 1 })]
[InlineData(new int[] { 1, 2, 3 })]
[InlineData(new int[] { 1, 2, 3, 4, 5, 6 })]
public void WithSelf(int[] data)
{
var set = new HashSet<int>(data);
Assert.Equal(1, ComputeJaccardIndex(set, set));
}
/// <summary>
/// Jaccard index of a set with empty set should be 0
/// </summary>
[Theory]
[InlineData(new int[] { 1 })]
[InlineData(new int[] { 1, 2, 3 })]
[InlineData(new int[] { 1, 2, 3, 4, 5, 6 })]
public void WithEmpty(int[] data)
{
var set = new HashSet<int>(data);
Assert.Equal(0, ComputeJaccardIndex(set, new HashSet<int>()));
Assert.Equal(0, ComputeJaccardIndex(new HashSet<int>(), set));
}
/// <summary>
/// Jaccard index of two empty sets is undefined, should return NaN
/// </summary>
[Fact]
public void BothEmpty()
{
Assert.True(double.IsNaN(ComputeJaccardIndex(new HashSet<int>(), new HashSet<int>())));
}
/// <summary>
/// Jaccard index computation should not mutate the input data
/// </summary>
[Fact]
public void NoDataMutation()
{
var set1 = new HashSet<int>() { 1, 2, 3 };
var set2 = new HashSet<int>() { 1, 2, 4, 5 };
var set1Copy = new HashSet<int>(set1);
var set2Copy = new HashSet<int>(set2);
ComputeJaccardIndex(set1, set2);
Assert.Equal(set1Copy, set1);
Assert.Equal(set2Copy, set2);
}
}
}
Note that test coverage could probably be improved, by testing operations with other data types or large sets, but this is a good start IMO.
struct
or for anything that implementsIEquatable
? Or should anIEqualityComparer
be an optional parameter? \$\endgroup\$IEquatable
interface (it seems to make sure that the two sets have equal types if I'm reading its Microsoft entry correctly). Yeah, I would prefer that the two input sets are of the same type. \$\endgroup\$