Duplicate words in a text

Question

Here is a simplified implementation to obtain the duplicate words in a text using lambda expressions.

public class FindDuplicateWordsInText {

    public static Set<String> findDuplicateWordsInText(String text) {
        String[] words = text.split(" ");
        Set<String> duplicatesRemovedSet = new HashSet<>();
        Set<String> duplicatesSet = Arrays.stream(words).filter(string -> !duplicatesRemovedSet.add(string))
                .collect(Collectors.toSet());
        return duplicatesSet;
    }
}

\$\begingroup\$ what is your main concern? \$\endgroup\$
– Malachi ♦
Commented Aug 12, 2015 at 19:16 — Malachi, Commented Aug 12, 2015 at 19:16

rolfl · Accepted Answer · 2015-08-12 18:58:38Z

Your use of the boolean return value of the Set.add() call is a clever way to check for your duplicates. The concept you have is good, and I can't think of a faster way.

Additionally, I like how you have used Interface-based types on the left-side of assignments Set<String> and the concrete classes on the right new HashSet<>() .... people often put the concrete type on the left too, and it's good to see that you did not.

In terms of the Java streaming API, though, I can't help but feel that you missed out on an opportunity to improve the process by streaming the split.... The Pattern class has a splitAsStream method which would reduce your latency on the first words....

As an aside, a word should probably be on a contiguous whitespace, not just a single space (i.e. "\\s+" instead of " ").

Here's your code done differently:

private static final Pattern SPACE = Pattern.compile("\\s+");

public static Set<String> findDuplicateWordsInText(String text) {
    Set<String> duplicatesRemovedSet = new HashSet<>();
    return SPACE.splitAsStream(text)
            .filter(string -> !duplicatesRemovedSet.add(string))
            .collect(Collectors.toSet());
}

Alexander Ivanchenko · Accepted Answer · 2024-05-23 11:59:09Z

Avoid Side-effects while using Streams

You seem to be using a Stream as if it was a loop by accumulating state outside the stream.

The filter() operation performs a side-affect, which is strongly discouraged by Stream API documentation:

Side-effects in behavioral parameters to stream operations are, in general, discouraged, as they can often lead to unwitting violations of the statelessness requirement, as well as other thread-safety hazards.

If the behavioral parameters do have side-effects, unless explicitly stated, there are no guarantees as to:

the visibility of those side-effects to other threads;

that different operations on the "same" element within the same stream pipeline are executed in the same thread; and

that behavioral parameters are always invoked, since a stream implementation is free to elide operations (or entire stages) from a stream pipeline if it can prove that it would not affect the result of the computation.

The stream pipeline in the code you shared is written based on the assumption that it'll only run sequentially which shouldn't be the case (i.e. even someone switches your stream to parallel, it should still produce a correct result, but this solution might fail).

While using functional feature of Java, try to stick with Pure functions, it'll help to make the code cleaner and more robust.

Naming

Use concise (yet clear) names.

Method name findDuplicateWordsInText can be shortened to findDuplicates without sacrificing a single shred of its clarity.

Avoid ambiguous/unclear names. The variable name duplicatesRemovedSet is misleading because this set is going to store every unique word, both duplicated and not duplicated (and as I said you shouldn't be using this Set with the Stream in such a way).
Class names are nouns communicating the purpose of a class.

Refactoring

To ditermine duplicated stream elements you can generate a Map asossiating each word with its number of occurences and then filter out elements that occur only once.

One of the ways to build the map of element frequencies is to make of use of the built-in Collector groupingBy() in conjunction with the Collector counting() as its downstream.

import static java.util.function.Function.identity;
import static java.util.stream.Collectors.*;

public class DuplicateFinder {
    private static final Pattern WHITE_SPACE = Pattern.compile("\\s+");
    private DuplicateFinder() {}
    
    public static Set<String> findDuplicates(String text) {
        return findDuplicates(getCountByWord(text));
    }
    
    private static Map<String, Long> getCountByWord(String text) {
        return WHITE_SPACE.splitAsStream(text)
            .collect(groupingBy(identity(), counting()));
    }
    
    private static Set<String> findDuplicates(Map<String, Long> wordCount) {
        return wordCount.entrySet()
            .stream()
            .filter(entry -> entry.getValue() > 1)
            .map(Map.Entry::getKey)
            .collect(toSet());
    }
}

Notes:

One might be tempted to combine the two methods in a single one by chaining the second stream after the collect producing a Map (there are plenty of answers on Stack Overflow showing such chains combined of multiple streams). Avoid doing so. It makes code more clattered and difficult to read.
It might be also tempting to try to modify the Map of frequencies as shown below and return its key set.

wordCount.values().removeIf(c -> c == 1);
return wordCount.keySet();

I would advise against doing this. The reason is two-fold:

Key set is a view backed by the Map, and when the key set is used it keeps the whole map with its internal storage alive. In case if there are only a few duplicates, the removal of entries with the value of 1 will leave behind an intact, almost unpopulated array of map's buckets (removal will not cause it to shrink). Hence, modifying the Map and using its key set instead of allocating a new set is not necessarily beneficial.
Collector groupingBy() gives no guarantees on the mutability of the Map it returns, so it's not a good idea to modify this Map in the first place.

There's another way to generate a Map of frequencies by using Collector toMap():

.collect(toMap(
    identity(),
    s -> 1,
    Integer::sum
));

For this particular task, both toMap and combination goupingBy + counting are equally good because it's very unlikely that Strings with size of several megabytes are floating in the memory of your application. Hence, we are free to use the version which we deem more readable.

But in general, utilizing goupingBy + counting might be beneficial from the performance perspective when the there's a significant number of elements in the stream and many of them might be duplicates. Collector counting is tailored for such scenarios, since it operates internally with a primitive long value. On the other hand, employing toMap will result in generating unnecessary Integer wrappers, especially when element frequencies are fare greater than 128 (which is the largest cached Integer value).

And if we expect little to no duplicates (and the number of stream elements is significant) toMap might be slightly more advantages since it'll return a Map with a smaller memory footprint, and we're piping each element through a single collector instead of using two.

That said, in most of the cases these differences are not that significant and as the old mantra says "premature optimization is evil". Hence, not a major thing to worry about, but worth knowing.

ThatOneGuy · Accepted Answer · 2015-08-12 19:04:20Z

The implementation seems fine. I would change the return type of the method to be a bit more general. Callers don't care about whether the duplicates are stored in a Set or not.

public static Collection<String> findDuplicateWordsInText(String text) {
    String[] words = text.split(" ");
    Set<String> duplicatesRemovedSet = new HashSet<>();
    List<String> duplicates = Arrays.stream(words).filter(
            string -> !duplicatesRemovedSet.add(string)).collect(Collectors.toList());
    return duplicates;
}

But honestly, using stream/filter doesn't make it much clearer. Why not stick with the standard approach?

public static Collection<String> findDuplicateWordsInText(String text) {
    String[] words = text.split(" ");
    List<String> duplicates = new ArrayList<>();
    Set<String> duplicatesRemovedSet = new HashSet<>();
    for (String word: words) {
        if (!duplicatesRemovedSet.add(word)) {
            duplicates.add(word);
        }
    }       
    return duplicates;
}

"Callers don't care about whether the duplicates are stored in a Set or not" - what if I want to find duplicate words and then pass them into another method that accepts a Set<String>? Callers benefit when a method has general input types and a specific output type. — Andrew McNamee, Commented Aug 13, 2015 at 4:59

Quill · Accepted Answer · 2015-08-13 08:04:37Z

Set.add(...) has side effects by nature, which means that your filter method isn't "pure". In the given example, I don't see a concrete downside, but if you later wanted to make the stream parallel, you'd need to use a concurrent Set.

For comparison, here's a side-effect free solution using Guava's Multiset:

public static Set<String> findDuplicateWordsInText(String text) {
  Multiset<String> wordMultiset =
      ImmutableMultiset.copyOf(Splitter.on(' ').split(text));
  return wordMultiset.entrySet()
      .stream()
      .filter(entry -> entry.getCount() > 1)
      .map(Multiset.Entry::getElement)
      .collect(Collectors.toSet());
}

It probably doesn't perform as well, but the intent should (hopefully) come across clearly.

Stack Exchange Network

Duplicate words in a text

4 Answers 4

Avoid Side-effects while using Streams

Naming

Refactoring

Not the answer you're looking for? Browse other questions tagged
java
stream
lambda
set
or ask your own question.

Hot Network Questions

Duplicate words in a text

4 Answers 4

Avoid Side-effects while using Streams

Naming

Refactoring

Not the answer you're looking for? Browse other questions tagged javastreamlambdaset or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
java
stream
lambda
set
or ask your own question.