8
\$\begingroup\$

I have written a function which counts how many unique words are included in a given text.

The source-code:

const computeCountUniqueWords = (strToExamine) => {
  let parts = strToExamine.split(" ");

  let validWords = parts.filter((word) => {
    return /^\w/.test(word);
  });

  let uniqueWords = new Set(validWords);

  return uniqueWords.size;
}

let text1 = "Lorem ipsum dolor sit amet consectetuer adipiscing elit aenean commodo ligula eget dolor.";
let text2 = "Etiam ultricies nisi vel augue. Curabitur ullamcorper";

console.log(`Text 1 has ${computeCountUniqueWords(text1)} words`);
console.log(`Text 2 has ${computeCountUniqueWords(text2)} words.`);

I think it has become quite neat and short.

Nevertheless: Is there a better way to solve the described task?

Moveover: Is my checking with the regular expression sufficient? Or should it be enhanced?

Looking forward to reading your answers.

\$\endgroup\$

3 Answers 3

8
\$\begingroup\$

Two problems

Your code has two problems.

  1. It does not remove punctuation from the words resulting in the same words not matching. Eg text1 has 12 unique words not 13. You count dolor and dolor. as different words.
  2. You are ignoring capitalization. You would count Dolor and dolor as different words rather than the same.

String.match

Update I was not paying attention on first posting. There is no /*RegExp.match*/

The better solution is to use the String.match to convert the matches to an array and then directly create the set from that. The code is just one line and the performance is 2.5 times faster than using String.replace as shown.

thus the ideal solution becomes...

const uniqueWords = txt => new Set(txt.toLowerCase().match(/\w+/g)).size;

String.replace

You can use a string replace to iterate over RegExp matches. It's a little hacky but I much prefer it over other methods such a using a RegExp to split the string, or RegExp.match, or RegExp.exec which have an ugly interface and are slower than String.replace.

Converting the text to lowercase using String.toLowerCase solves the capitalization problem

const countUniqueWords = text => {
    const words = new Set();
    text.toLowerCase().replace(/\w+/g, word => words.add(word));
    return words.size;        
}




const a = "Lorem ipsum dolor sit amet consectetuer adipiscing elit aenean commodo ligula eget dolor.";
const b = "Etiam ultricies nisi vel augue. Curabitur ullamcorper.";
info1.textContent = `A has ${countUniqueWords(a)} unique words`;
info2.textContent = `B has ${countUniqueWords(b)} unique words.`;
 
<code>A: "Lorem ipsum dolor sit amet consectetuer adipiscing elit aenean commodo ligula eget dolor."</code></br>
<code id="info1"></code><br>
<code>B: "Etiam ultricies nisi vel augue. Curabitur ullamcorper."</code></br>
<code id="info2"></code>

\$\endgroup\$
3
  • \$\begingroup\$ As a n00b, what's wrong with the straightforward text.toLowerCase().match(/\w+/g)? \$\endgroup\$
    – JollyJoker
    Commented Nov 12, 2018 at 9:52
  • 1
    \$\begingroup\$ @JollyJoker there is nothing wrong with it, don't know what I was thinking,, my answer is flawed. Sorry I will delete it. \$\endgroup\$
    – Blindman67
    Commented Nov 12, 2018 at 14:39
  • 1
    \$\begingroup\$ @JollyJoker Can't delete the answer, so havre updated the answer correcting the errors... \$\endgroup\$
    – Blindman67
    Commented Nov 12, 2018 at 15:24
6
\$\begingroup\$

I like the simplicity of your function, it's pretty easy to understand the implementation.

Couple things I would consider:

A function "computeCountUniqueWords" does a very specific task that counts the number of unique words given a string. But depending on the context of this code(for example, where it's being used, maybe just as a utility library), I probably would prefer a more general function that just gets the unique words in an array and implement "computeCountUniqueWords" based on it. It's a little more functional and have a more general utility. For example:

const getUniqueWords = (string) => {
  ...
  return uniqueWords; 
}

const computeCountUniqueWords = (string) => {
  return getUniqueWords(string).length;
}

Actually a lot of the times, you'll realize that the code reads very well without "computeCountUniqueWords" function by just calling getUniqueWords(paragraph).length.

Second thing to consider is what type of string data this function will run on. If performance is not a consideration such as we are processing a small number of strings(even possibly in the millions of words), I would keep the function written as is for the sake for readability and simplicity.

But if this is used on the scale of google crawler or something done frequently such as on mouse move event then the implementation as it stands would be inefficient and not ideal. If we think about it, we can do the 3 operations (splitting strings based on spaces, testing if the string is a valid word and remove uniqueness in one loop through the input string). As it stands, it's probably looping through the input string 3 times which can be a big deal for a very large input string or in a DOM environment, it could hurt the frames per second of the page.

\$\endgroup\$
2
\$\begingroup\$

Here's the match solution that Blindman67 wanted to avoid for reasons still unclear to me.

If you find it hard to read you could split out something like const words = s => s.match(/\w+/g) or const lowerWords = s => s.toLowerCase().match(/\w+/g)

const countUniqueWords = s => new Set(s.toLowerCase().match(/\w+/g)).size


const a = "Lorem ipsum dolor sit amet consectetuer adipiscing elit aenean commodo ligula eget dolor.";
const b = "Etiam ultricies nisi vel augue. Curabitur ullamcorper.";
info1.textContent = `A has ${countUniqueWords(a)} unique words`;
info2.textContent = `B has ${countUniqueWords(b)} unique words.`;
 
<code>A: "Lorem ipsum dolor sit amet consectetuer adipiscing elit aenean commodo ligula eget dolor."</code></br>
<code id="info1"></code><br>
<code>B: "Etiam ultricies nisi vel augue. Curabitur ullamcorper."</code></br>
<code id="info2"></code>

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.