Best way to find a substring in a string

Question

I have a problem where I am trying to search for a substring in string. That substring may or may not be in the string.

str = "hello how are you?"
substr = "how are"

Two ways which I know if can be done are:

string.indexOf("how are")
regex

But, is there any other "optimized" way? What would you have done?

Can Ruby provide a better answer? Since we use jRuby the answer can be in Ruby or Java.

Know it's old: But "best way" depends a lot on what the goal is. If it is ease of development, str.include?(substring) is great, but if it's speed then there might be some other approach that is better. — Automatico, Commented May 18, 2014 at 15:59

the Tin Man · Accepted Answer · 2015-11-21 22:52:48Z

11

In Ruby, use the String#include? method:

str = "hello how are you?"
substr = "how are"
str.include? substr

which returns true.

edited Nov 21, 2015 at 22:52

the Tin Man

160k44 gold badges218 silver badges306 bronze badges

answered Oct 6, 2010 at 19:45

Jas

1,1415 silver badges16 bronze badges

I know that I'm late here, but remember that "include?" is case sensitive.
– HomeworkHopper
Commented May 12, 2020 at 16:17

Add a comment |

the Tin Man · Accepted Answer · 2020-05-10 06:20:30Z

"What would you have done?"

I'd do a benchmark and try comparing different ways of accomplishing the same thing to learn which is fastest.

In older versions of Ruby we'd see the regex-based searches running more slowly. The new engine in 1.9.2, which I'm using for the benchmark, makes a big difference. In particular, unanchored searches used to be a lot slower than anchored. Now it's a wash whether you use regex or a fixed string searches for the most part. Using match() without precompiling the regex is a painful hit for speed so if you're doing a lot of loops with the same pattern then it makes sense to assign the pattern to a variable and reference the variable.

The times displayed are how long each test took to perform "n" (750,000) iterations, so lower numbers are better.

require 'benchmark'

LOREM = %q{Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut et convallis purus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Cras interdum nibh et nunc pellentesque vestibulum. Donec elementum felis malesuada urna vehicula consectetur commodo urna accumsan. Phasellus bibendum euismod tincidunt. Sed pellentesque cursus faucibus. Etiam bibendum tincidunt nibh eget ultrices. Fusce imperdiet, felis id consequat imperdiet, justo est ultrices elit, sed vestibulum dui nibh vel felis. Sed feugiat, libero quis consequat semper, magna tellus facilisis enim, rutrum adipiscing eros mauris commodo metus. Sed lobortis aliquet augue ac sodales. Quisque pharetra odio vel augue tempus porttitor.}

REGEX1 = %r{/porttitor\.$/}
REGEX2 = %r{/porttitor\./}
REGEX3 = %r{/porttitor\.\Z/}

n = 750_000
puts "word in string"
Benchmark.bm(15) do |x|
  x.report('string[""]:')   { n.times { LOREM['porttitor.']          } }
  x.report('string[//]:')   { n.times { LOREM[/porttitor\./]         } } # unanchored regex
  x.report('string[/$/]:')  { n.times { LOREM[/porttitor\.$/]        } } # anchored regex
  x.report('string[/\Z/]:') { n.times { LOREM[/porttitor\.\Z/]       } } # anchored regex
  x.report('index():')      { n.times { LOREM.index('porttitor.')    } }
  x.report('include?():')   { n.times { LOREM.include?('porttitor.') } }
  x.report('match($):')     { n.times { LOREM.match(/porttitor\.$/)  } }
  x.report('match(\Z):')    { n.times { LOREM.match(/porttitor\.\Z/) } }
  x.report('match():')      { n.times { LOREM.match(/porttitor\./)   } }
  x.report('match2($):')    { n.times { LOREM.match(REGEX1)          } } # compiled regex w/ anchor
  x.report('match2():')     { n.times { LOREM.match(REGEX2)          } } # compiled report w/out anchor
  x.report('match2(\Z):')   { n.times { LOREM.match(REGEX3)          } } # compiled regex w/ anchor
end

puts
puts "word not in string"
Benchmark.bm(15) do |x|
  x.report('string[""]:')   { n.times { LOREM['porttit0r.']          } }
  x.report('string[//]:')   { n.times { LOREM[/porttit0r\./]         } } # unanchored regex
  x.report('string[/$/]:')  { n.times { LOREM[/porttit0r\.$/]        } } # anchored regex
  x.report('string[/\Z/]:') { n.times { LOREM[/porttit0r\.\Z/]       } } # anchored regex
  x.report('index():')      { n.times { LOREM.index('porttit0r.')    } }
  x.report('include?():')   { n.times { LOREM.include?('porttit0r.') } }
  x.report('match($):')     { n.times { LOREM.match(/porttit0r\.$/)  } }
  x.report('match(\Z):')    { n.times { LOREM.match(/porttit0r\.\Z/) } }
  x.report('match():')      { n.times { LOREM.match(/porttit0r\./)   } }
end

With the output:

word in string
                      user     system      total        real
string[""]:       0.670000   0.000000   0.670000 (  0.675319)
string[//]:       0.700000   0.000000   0.700000 (  0.706148)
string[/$/]:      0.720000   0.000000   0.720000 (  0.716853)
string[/\Z/]:     0.530000   0.000000   0.530000 (  0.527568)
index():          0.630000   0.000000   0.630000 (  0.638562)
include?():       0.610000   0.000000   0.610000 (  0.603223)
match($):         1.690000   0.000000   1.690000 (  1.696045)
match(\Z):        1.520000   0.010000   1.530000 (  1.532107)
match():          1.700000   0.000000   1.700000 (  1.698748)
match2($):        0.840000   0.000000   0.840000 (  0.847590)
match2():         0.840000   0.000000   0.840000 (  0.840969)
match2(\Z):       0.840000   0.000000   0.840000 (  0.835557)

word not in string
                      user     system      total        real
string[""]:       0.570000   0.000000   0.570000 (  0.578120)
string[//]:       0.740000   0.000000   0.740000 (  0.734751)
string[/$/]:      0.730000   0.000000   0.730000 (  0.735599)
string[/\Z/]:     0.560000   0.000000   0.560000 (  0.563673)
index():          0.620000   0.000000   0.620000 (  0.619451)
include?():       0.570000   0.000000   0.570000 (  0.574413)
match($):         0.910000   0.010000   0.920000 (  0.910059)
match(\Z):        0.730000   0.000000   0.730000 (  0.726533)
match():          0.950000   0.000000   0.950000 (  0.960865)

For reference, here's some numbers using Ruby 1.8.7, which is the default for Snow Leopard:

word in string
                     user     system      total        real
string[""]:      1.130000   0.000000   1.130000 (  1.130687)
string[//]:      1.170000   0.000000   1.170000 (  1.165692)
string[/$/]:     1.180000   0.000000   1.180000 (  1.184954)
string[/\Z/]:    1.180000   0.000000   1.180000 (  1.179168)
index():         1.070000   0.000000   1.070000 (  1.077791)
include?():      1.060000   0.000000   1.060000 (  1.056430)
match($):        1.470000   0.010000   1.480000 (  1.472797)
match(\Z):       1.480000   0.000000   1.480000 (  1.490172)
match():         1.480000   0.000000   1.480000 (  1.478146)
match2($):       0.650000   0.000000   0.650000 (  0.653029)
match2():        0.570000   0.000000   0.570000 (  0.574384)
match2(\Z):      0.640000   0.000000   0.640000 (  0.646688)

word not in string
                     user     system      total        real
string[""]:      1.040000   0.000000   1.040000 (  1.038885)
string[//]:      0.510000   0.000000   0.510000 (  0.507031)
string[/$/]:     0.510000   0.000000   0.510000 (  0.508425)
string[/\Z/]:    0.500000   0.000000   0.500000 (  0.507316)
index():         1.060000   0.000000   1.060000 (  1.055157)
include?():      1.030000   0.000000   1.030000 (  1.037060)
match($):        0.630000   0.000000   0.630000 (  0.623627)
match(\Z):       0.620000   0.000000   0.620000 (  0.624737)
match():         0.620000   0.000000   0.620000 (  0.623049)

I added additional tests to give some ideas of the effects of using only unanchored and anchored regex:

require 'fruity'

LOREM = %{Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut et convallis purus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Cras interdum nibh et nunc pellentesque vestibulum. Donec elementum felis malesuada urna vehicula consectetur commodo urna accumsan. Phasellus bibendum euismod tincidunt. Sed pellentesque cursus faucibus. Etiam bibendum tincidunt nibh eget ultrices. Fusce imperdiet, felis id consequat imperdiet, justo est ultrices elit, sed vestibulum dui nibh vel felis. Sed feugiat, libero quis consequat semper, magna tellus facilisis enim, rutrum adipiscing eros mauris commodo metus. Sed lobortis aliquet augue ac sodales. Quisque pharetra odio vel augue tempus porttitor.}

compare do
  str_slice_regex  { LOREM[/porttitor\./]         } # unanchored regex
  str_slice_dollar { LOREM[/porttitor\.$/]        } # anchored regex
  str_slice_ctrlZ  { LOREM[/porttitor\.\Z/]       } # anchored regex
  str_slice_ctrlz  { LOREM[/porttitor\.\z/]       } # anchored regex
end

# >> Running each test 8192 times. Test will take about 1 second.
# >> str_slice_ctrlz is similar to str_slice_ctrlZ
# >> str_slice_ctrlZ is faster than str_slice_regex by 2x ± 0.1
# >> str_slice_regex is similar to str_slice_dollar

This is using Fruity, so the results aren't directly correlated with the information above, but it's still useful.

Here's some updated information:

# >> Running on Ruby v.2.7.0
# >> word in string
# >>                       user     system      total        real
# >> string[""]:       0.368283   0.000147   0.368430 (  0.368468)
# >> string[//]:       0.329253   0.000080   0.329333 (  0.329466)
# >> string[/$/]:      0.330270   0.000172   0.330442 (  0.330594)
# >> string[/\Z/]:     0.183119   0.000048   0.183167 (  0.183209)
# >> index():          0.358397   0.000289   0.358686 (  0.360185)
# >> include?():       0.352700   0.000196   0.352896 (  0.353056)
# >> match($):         0.761605   0.001502   0.763107 (  0.763297)
# >> match(\Z):        0.631132   0.000507   0.631639 (  0.631767)
# >> match():          0.765219   0.000634   0.765853 (  0.766199)
# >> match2($):        0.394938   0.000128   0.395066 (  0.395173)
# >> match2():         0.391687   0.000080   0.391767 (  0.391879)
# >> match2(\Z):       0.389440   0.000089   0.389529 (  0.389678)
# >> 
# >> word not in string
# >>                       user     system      total        real
# >> string[""]:       0.365097   0.000117   0.365214 (  0.365262)
# >> string[//]:       0.388117   0.000361   0.388478 (  0.389008)
# >> string[/$/]:      0.381933   0.000091   0.382024 (  0.382061)
# >> string[/\Z/]:     0.236101   0.000086   0.236187 (  0.236307)
# >> index():          0.369898   0.000131   0.370029 (  0.370267)
# >> include?():       0.361057   0.000065   0.361122 (  0.361202)
# >> match($):         0.409296   0.000390   0.409686 (  0.410103)
# >> match(\Z):        0.258903   0.000074   0.258977 (  0.259095)
# >> match():          0.399220   0.000070   0.399290 (  0.399386)
# >> --------------------
# >> Running each test 16384 times. Test will take about 1 second.
# >> str_slice_ctrlz is similar to str_slice_ctrlZ
# >> str_slice_ctrlZ is faster than str_slice_dollar by 2x ± 0.1
# >> str_slice_dollar is similar to str_slice_regex

"Finding if a sentence contains a specific phrase in Ruby" is related.

Nice job! But it remains be strange for me, that the str[substr] is slower than regexps. In theory, str[substr] is the easiest method to implement in the way of good perfomance. I hope, once japans will optimize it. — Nakilon, Commented Oct 7, 2010 at 17:38
I haven't looked at the source, but I suspect that str[substr] really treats every substring inside the brackets as a regex, forcing a recompile of the pattern, or skipping to some other code that does the fixed-string search. Doing a fixed-string search in C/C++ is easy so there must be some other processing going on. Or, I did the benchmark wrong but I don't think so. — the Tin Man, Commented Oct 7, 2010 at 18:07

the Tin Man · Accepted Answer · 2020-05-10 06:25:02Z

5

For an overview of "other ways", you can start with the "String-searching algorithm" article on Wikipedia.

Indexing strings using "Substring index" is one very obvious way of speeding things up, as mentioned by Martin, which only is appropriate if you are doing several searches over the same string:

edited May 10, 2020 at 6:25

the Tin Man

160k44 gold badges218 silver badges306 bronze badges

answered Oct 6, 2010 at 19:47

HostileFork says dont trust SE

33.4k12 gold badges100 silver badges170 bronze badges

Add a comment |

the Tin Man · Accepted Answer · 2020-02-14 19:01:13Z

2

If you want only check if a substring is in string, you can use: str[substr].

It returns substring or nil.

edited Feb 14, 2020 at 19:01

the Tin Man

160k44 gold badges218 silver badges306 bronze badges

answered Oct 6, 2010 at 19:46

Nakilon

35.4k15 gold badges111 silver badges146 bronze badges

Add a comment |

Martin Törnwall · Accepted Answer · 2010-10-06 19:46:42Z

0

To my knowledge there's no "magic" way to search for substrings really fast, unless you're ready to build some kind of search metadata (think index) beforehand. Doing so will most likely waste more time than it saves you unless you search through the same string a lot.

As the search pattern is simple, I would steer clear of regex.

answered Oct 6, 2010 at 19:46

Martin Törnwall

9,5192 gold badges30 silver badges36 bronze badges

Add a comment |

O. Jones · Accepted Answer · 2012-08-20 12:40:02Z

0

If you're convinced that the facilities in your runtime system (which offer string-in-string search and the like) are not fast enough for your application, try implementing the KMP algorithm.

But, the implementers of modern runtime systems probably have already done this for you.

edited Aug 20, 2012 at 12:40

O. Jones

107k17 gold badges127 silver badges178 bronze badges

answered Oct 6, 2010 at 20:16

Stan Kurilin

15.7k22 gold badges83 silver badges132 bronze badges

Add a comment |

sirmak · Accepted Answer · 2010-10-06 19:41:47Z

-2

the best way is indexof, regex is slower

answered Oct 6, 2010 at 19:41

sirmak

3,7144 gold badges31 silver badges34 bronze badges

Show data for your assertions.
– the Tin Man
Commented Feb 14, 2020 at 19:02

Add a comment |

Collectives™ on Stack Overflow

Best way to find a substring in a string

7 Answers 7

Not the answer you're looking for? Browse other questions tagged
java
ruby
string
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Not the answer you're looking for? Browse other questions tagged javarubystring or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
java
ruby
string
or ask your own question.