Here's a skeleton of a class I built that loops through and deduplicates data - it's in C# but the principles of the question aren't language specific.
public static void DedupeFile(FileContents fc)
{
BuildNameKeys(fc);
SetExactDuplicates(fc);
FuzzyMatching(fc);
}
// algorithm to calculate fuzzy similarity between surname strings
public static bool SurnameMatch(string surname1, string surname2)
// algorithm to calculate fuzzy similarity between forename strings
public static bool ForenameMatch(string forename1, string forename2)
// algorithm to calculate fuzzy similarity between title strings
public static bool TitleMatch(string title1, string title2)
// used by above fn to recognise that "Mr" isn't the same as "Ms" etc
public static bool MrAndMrs(string title1, string title2)
// gives each row a unique key based on name
public static void BuildNameKeys(FileContents fc)
// loops round data to find exact duplicates
public static void SetExactDuplicates(FileContents fc)
// threads looping round file to find fuzzy duplicates
public static void FuzzyMatching(FileContents fc, int maxParallels = 32)
Now, in actual usage only the first function actually needs to be public. All the rest are only used inside this class and nowhere else.
Strictly that means they should of course be private. However, I've left them public for ease of unit testing. Some people will no doubt tell me I should be testing them via the public interface but that's partly why I picked this class: it's an excellent example of where that approach gets awkward. The fuzzy matching functions are great candidates for unit tests, and yet a test on that single "public" function would be near-useless.
This class won't ever get used outside a small team at this office, and I don't believe that the structural understanding imparted by making the other methods private is worth the extra faff of packing my tests with code to access private methods directly.
Is this "all public" approach reasonable for classes in internal software? Or is there a better approach?
I am aware there is already a question on How do you unit test private methods?, but this question is about whether there are scenarios where it's worthwhile bypassing those techniques in favour of simply leaving methods public.
EDIT: For those interested, I added the full code on CodeReviewSE as restructuring this class seemed too good a learning opportunity to miss.
TestDedupeFile_WhenCalledWithFuzzyTitles_MatchesThem
. You would have as many tests for the public method, as you would if you wrote one for each private method.