C# console app that computes the Jaccard Index

Question

So I wrote a basic C# console application that computes the Jaccard Index. This has its applications in cybersecurity and machine learning. Given two sets X and Y, the basic formula is:

$$ J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|} $$

where the bars in this context refers to the number of items in the set or the result of its intersection or union, not the absolute value. The jaccard(HashSet<int>, HashSet<int>) function takes in two HashSet<int> types and then implements this formula, and the Main(string[]) give the jaccard an example. The following code is my implementation of it:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
 
public class Jaccard {

    public static void Main(string[] args) {
        HashSet<int> testSet1 = new HashSet<int>() {1, 2, 3};
        HashSet<int> testSet2 = new HashSet<int>() {1, 2, 3};
        Console.WriteLine("Test Jaccard is: {0}", jaccard(testSet1, testSet2));
    }
    
    public static double jaccard(HashSet<int> set1, HashSet<int> set2) {
        set1.IntersectWith(set2);
        set2.UnionWith(set2);
        // length of set intersection ÷ length of set union
        return set1.Count / set2.Count;
    }
}

Questions:

Is there a way to implement HashSet<???> to take in a object of any type? Or do I have to just overload the jaccard with different types?
From an aesthetic and style perspective, how is my code?
From a security perspective, how is my code? Do you see any significant code injection or memory corruption vulnerabilities (I know that C# protects against memory corruption, but you never know ;-)?
If I should refactor my code, what do you recommend?

You may view my GitHub repo where the source code is stored here: https://github.com/Alekseyyy/InfoSec/blob/master/reversing/jaccard.cs

Do you want the method to be generic for any struct or for anything that implements IEquatable? Or should an IEqualityComparer be an optional parameter? — Rick Davin, Commented Aug 3, 2023 at 15:21
I'm not familiar with IEquatable interface (it seems to make sure that the two sets have equal types if I'm reading its Microsoft entry correctly). Yeah, I would prefer that the two input sets are of the same type. — Aleksey, Commented Aug 3, 2023 at 22:01

gazoh · Accepted Answer · 2023-08-03 09:31:27Z

Bug

Your logic is flawed when computing the union of the sets: it doesn't include data from set1 in the union of sets. This leads to incorrect results in many cases.

Style

C# conventions recommend using PascalCase for method names. This makes the code easier to read for anyone who has some C# experience.

Method names should have a verb describing the action they perform. Again, this improves readability.

Remove unused using statements. These just add useless noise to the code.

Side effects

Glossing over the bugged logic, your method mutates its inputs, which means it isn't possible to do any further work with your sets after computing their Jaccard index, which will most likely cause issues.

Prefer using LINQ methods Union and Intersect over HashSet's UnionWith and IntersectWith, as the former return a new IEnumerable instead of mutating the calling instance.

Make your method static

Your method doesn't require any class members, and should be made static in order to use it without instantiating a class.

Make it generic

There is no reason to limit your method to work with HashSet<int>, as the logic doesn't rely on the underlying data type.

Fewer operations

Set operations are the most expensive operations in your code. You can save computing the set union and just work with the counts of inputs and intersection.

Document your code

This makes the code easier to reuse by other people, or yourself in the future.

Include tests

If your code was properly tested, you would have caught your bug.

Sample code:

JaccardIndex.cs

namespace JaccardIndex
{
    public static class JaccardIndex
    {
        /// <summary>
        /// Compute the Jaccard index between two sets.
        /// The Jaccard index gauges the similarity between two sets, and ranges from 0.0 (disjoint sets) to 1.0 (identical sets).
        /// </summary>
        /// <typeparam name="T">Underlying data types of input sets</typeparam>
        /// <param name="set1">1st input set</param>
        /// <param name="set2">2nd imput set</param>
        /// <returns>Jaccard index</returns>
        public static double ComputeJaccardIndex<T>(HashSet<T> set1, HashSet<T> set2)
        {
            var intesectionCount = (double)set1.Intersect(set2).Count();
            return intesectionCount / (set1.Count() + set2.Count() - intesectionCount);
        }
    }
}

JaccardIndexTests.cs

using Xunit;
using static JaccardIndex.JaccardIndex;

namespace JaccardIndexTests
{
    public class JaccardIndexTests
    {
        /// <summary>
        /// Verify that some knwon values are computed correctly
        /// </summary>
        [Theory]
        [InlineData(new int[] { 1, 2, 3 }, new int[] { 1 }, 1.0 / 3.0)]
        [InlineData(new int[] { 1 }, new int[] { 1 }, 1.0 )]
        [InlineData(new int[] { 1 }, new int[] { 2 }, 0.0)]
        [InlineData(new int[] { 1, 2, 3 }, new int[] { 1, 2, 4 }, 2.0 / 4.0)]
        [InlineData(new int[] { 1, 2, 3 }, new int[] { 4, 5, 6 }, 0.0)]
        public void Samples(int[] data1, int[] data2, double expected)
        {
            var set1 = new HashSet<int>(data1);
            var set2 = new HashSet<int>(data2);
            Assert.Equal(expected, ComputeJaccardIndex(set1, set2));
        }
        /// <summary>
        /// Jaccard index of a should be commutative
        /// </summary>
        [Theory]
        [InlineData(new int[] { 1, 2, 3 }, new int[] { 1 })]
        [InlineData(new int[] { 1 }, new int[] { 1 })]
        [InlineData(new int[] { 1 }, new int[] { 2 })]
        [InlineData(new int[] { 1, 2, 3 }, new int[] { 1, 2, 4 })]
        [InlineData(new int[] { 1, 2, 3 }, new int[] { 4, 5, 6 })]
        public void Symmetry(int[] data1, int[] data2)
        {
            var set1 = new HashSet<int>(data1);
            var set2 = new HashSet<int>(data2);
            Assert.Equal(ComputeJaccardIndex(set1, set2), ComputeJaccardIndex(set2, set1));
        }
        /// <summary>
        /// Jaccard index of a set with itelf should be 1
        /// </summary>
        [Theory]
        [InlineData(new int[] { 1 })]
        [InlineData(new int[] { 1, 2, 3 })]
        [InlineData(new int[] { 1, 2, 3, 4, 5, 6 })]
        public void WithSelf(int[] data)
        {
            var set = new HashSet<int>(data);
            Assert.Equal(1, ComputeJaccardIndex(set, set));
        }
        /// <summary>
        /// Jaccard index of a set with empty set should be 0
        /// </summary>
        [Theory]
        [InlineData(new int[] { 1 })]
        [InlineData(new int[] { 1, 2, 3 })]
        [InlineData(new int[] { 1, 2, 3, 4, 5, 6 })]
        public void WithEmpty(int[] data)
        {
            var set = new HashSet<int>(data);
            Assert.Equal(0, ComputeJaccardIndex(set, new HashSet<int>()));
            Assert.Equal(0, ComputeJaccardIndex(new HashSet<int>(), set));
        }
        /// <summary>
        /// Jaccard index of two empty sets is undefined, should return NaN
        /// </summary>
        [Fact]
        public void BothEmpty()
        {
            Assert.True(double.IsNaN(ComputeJaccardIndex(new HashSet<int>(), new HashSet<int>())));
        }
        /// <summary>
        /// Jaccard index computation should not mutate the input data
        /// </summary>
        [Fact]
        public void NoDataMutation()
        {
            var set1 = new HashSet<int>() { 1, 2, 3 };
            var set2 = new HashSet<int>() { 1, 2, 4, 5 };

            var set1Copy = new HashSet<int>(set1); 
            var set2Copy = new HashSet<int>(set2);

            ComputeJaccardIndex(set1, set2);

            Assert.Equal(set1Copy, set1);
            Assert.Equal(set2Copy, set2);
        }
    }
}

Note that test coverage could probably be improved, by testing operations with other data types or large sets, but this is a good start IMO.

Thanks for them protips. I especially liked your inclusion of xunit and unit testing introduction. Other users pointed out the bug where I ran a UnionWith on set2 twice (I'm such a silly sausage and need to fix that lol). — Aleksey, Commented Aug 3, 2023 at 21:58

RobH · Accepted Answer · 2023-08-03 19:48:52Z

It looks to me like your code is wrong:

set2.UnionWith(set2); // set2 unioned with itself is a no-op.

The fact that HashSet mutates itself makes naming them a bit tricky as they often change purpose in a method without a nice way of encoding that. In this case, you'd benefit from using the alternate formulation of the Jaccard index as you only need to do one set operation (the intersection). Note that the count of the union is simply the count of the individual sets minus the count of intersection.

// Untested!
public static double CalculateJaccardIndex<T>(HashSet<T> x, HashSet<T> y) 
{
    var xCount = (double)x.Count;
    var yCount = (double)y.Count;

    // As pointed out by gazoh, mutating the input is not ideal, create a new set and mutate that.
    var intersection = new HashSet<T>(x);
    intersection.IntersectWith(y);
    var intersectionCount = (double)intersection.Count;

    return intersectionCount / (xCount + yCount - intersectionCount);
}

I've made the method generic so it works on more than just ints
I've made sure the calculation is done using floating point, not integer, division
I've renamed it to follow the C# convention of PascalCase for methods
I've renamed set1 and set2 to x and y to match the definition you shared

You may want to consider writing some unit tests for this kind of thing to ensure the implementation is correct.

Oh, crumpets. You're right, I should have used more test cases (regarding the set arithmetic and me unioning set2 with itself. Thanks for coding protips :D — Aleksey, Commented Aug 2, 2023 at 16:30

Stack Exchange Network

C# console app that computes the Jaccard Index

2 Answers 2

Bug

Style

Side effects

Make your method static

Make it generic

Fewer operations

Document your code

Include tests

Sample code:

Not the answer you're looking for? Browse other questions tagged
c#
.net
hash-map
mathematics
or ask your own question.

Hot Network Questions

C# console app that computes the Jaccard Index

2 Answers 2

Bug

Style

Side effects

Make your method static

Make it generic

Fewer operations

Document your code

Include tests

Sample code:

Not the answer you're looking for? Browse other questions tagged c#.nethash-mapmathematics or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
c#
.net
hash-map
mathematics
or ask your own question.