Replace strings based on substring match

Question

I have N strings and M search-replace pairs. Each of the strings contains exactly one of the search pair and the whole string needs to be replaced by the replace pair.

Say you have returns,between,paragraphs and turn => foo, tween => bar, rag => baz then your output is foo, bar, baz.

N can be a real big number while M is small. What's an efficient algorithm for this?

cmaster - reinstate monica · Accepted Answer · 2016-09-18 14:42:24Z

The most efficient algorithm would be to first construct a finite state machine that a) recognizes any of your keys, and b) has a different end state for each key, i. e. producing the index of the key that was recognized.

Part a) is as easy as calling regcomp() appropriately. Unfortunately, this won't produce the index you need right away (part b)), it will just provide you with the beginning and end position of the recognized string.

So, unless you want to go through the trouble of reimplementing a regex compilation routine, I guess your best bet is to subsequently look up the key from a hash table. However, again it is difficult to use a standard hash table implementation without triggering memory allocation by passing the key as a string. Of course, you can try to use a perfect hash for the lookup. Nevertheless, any compromise that takes you away from a finite state machine with the two properties a) and b) will incur a heavy slowdown.

Tulains Córdova · Accepted Answer · 2016-08-18 23:28:03Z

0

Create an empty list to store the results
Make a key/value pair list like a HasmMap or Dictionary with the key/value pairs turn => -foo, tween => bar, rag => baz.
Make a List of the input strings returns,between,paragraphs, etc
Iterate the map of key/values
Inside that loop iterate the list of input strings
For every input string that contains the key as a substring, return the value for that key add it to the results list you created in step 1 and break out of the inner loop to continue with the next outer iteration
Optionally iterate the results list to concatenate a comma-separate list like foo, bar, baz

answered Aug 18, 2016 at 23:28

Tulains Córdova

39.4k13 gold badges98 silver badges155 bronze badges

1

Isn't that sort of a naive implementation? I was thinking perhaps use Aho-Corasick or Commentz-Walter algorithm to do the matching...
– chx
Commented Aug 19, 2016 at 5:54
1

@chx If you already know these algorithms, why didn't you mention that in the question? Why don't they fit your problem? One point to keep in mind though is that your M is small, and that a naïve algorithm may be able to outperform a more sophisticated one for small inputs. In particular, trie based algorithms will involve a lot of pointer indirection. If you're not looking for “good enough” performance but want to perfectly tune your application, consider benchmarking this.
– amon
Commented Aug 19, 2016 at 7:01

Add a comment |

YSharp · Accepted Answer · 2016-09-30 17:08:37Z

0

I would first go with the Aho–Corasick algorithm which has O(n+m) complexity in time, and O(m) in space (n: length of input; m: combined length of the patterns) and measure if that's "good enough" (your call) -- especially since you already know that you can expect exactly one occurrence of one of the patterns in any given input.

'HTH,

answered Sep 30, 2016 at 17:08

YSharp

8886 silver badges10 bronze badges

Add a comment |

Stack Exchange Network

Replace strings based on substring match

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
algorithms
string-matching
or ask your own question.

Hot Network Questions

Replace strings based on substring match

3 Answers 3

Not the answer you're looking for? Browse other questions tagged algorithmsstring-matching or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
algorithms
string-matching
or ask your own question.