3

I have N strings and M search-replace pairs. Each of the strings contains exactly one of the search pair and the whole string needs to be replaced by the replace pair.

Say you have returns,between,paragraphs and turn => foo, tween => bar, rag => baz then your output is foo, bar, baz.

N can be a real big number while M is small. What's an efficient algorithm for this?

3 Answers 3

1

The most efficient algorithm would be to first construct a finite state machine that a) recognizes any of your keys, and b) has a different end state for each key, i. e. producing the index of the key that was recognized.

Part a) is as easy as calling regcomp() appropriately. Unfortunately, this won't produce the index you need right away (part b)), it will just provide you with the beginning and end position of the recognized string.

So, unless you want to go through the trouble of reimplementing a regex compilation routine, I guess your best bet is to subsequently look up the key from a hash table. However, again it is difficult to use a standard hash table implementation without triggering memory allocation by passing the key as a string. Of course, you can try to use a perfect hash for the lookup. Nevertheless, any compromise that takes you away from a finite state machine with the two properties a) and b) will incur a heavy slowdown.

0
  1. Create an empty list to store the results
  2. Make a key/value pair list like a HasmMap or Dictionary with the key/value pairs turn => -foo, tween => bar, rag => baz.
  3. Make a List of the input strings returns,between,paragraphs, etc
  4. Iterate the map of key/values
  5. Inside that loop iterate the list of input strings
  6. For every input string that contains the key as a substring, return the value for that key add it to the results list you created in step 1 and break out of the inner loop to continue with the next outer iteration
  7. Optionally iterate the results list to concatenate a comma-separate list like foo, bar, baz
2
  • 1
    Isn't that sort of a naive implementation? I was thinking perhaps use Aho-Corasick or Commentz-Walter algorithm to do the matching...
    – chx
    Commented Aug 19, 2016 at 5:54
  • 1
    @chx If you already know these algorithms, why didn't you mention that in the question? Why don't they fit your problem? One point to keep in mind though is that your M is small, and that a naïve algorithm may be able to outperform a more sophisticated one for small inputs. In particular, trie based algorithms will involve a lot of pointer indirection. If you're not looking for “good enough” performance but want to perfectly tune your application, consider benchmarking this.
    – amon
    Commented Aug 19, 2016 at 7:01
0

I would first go with the Aho–Corasick algorithm which has O(n+m) complexity in time, and O(m) in space (n: length of input; m: combined length of the patterns) and measure if that's "good enough" (your call) -- especially since you already know that you can expect exactly one occurrence of one of the patterns in any given input.

'HTH,

Not the answer you're looking for? Browse other questions tagged or ask your own question.