Preg_Replace and UTF8

Question

I'm enhancing our video search page to highlight the search term(s) in the results. Because user can enter judas priest and a video has Judas Priest in it's text I have to use regular expressions to preserve the case of the original text.

My code works, but I have problems with special characters like š, č and ž, it seems that Preg_Replace() will only match if the case is the same (despite the /ui modifier). My code:

$Content = Preg_Replace ( '/\b(' . $term . '?)\b/iu', '<span class="HighlightTerm">$1</span>', $Content );

I also tried this:

$Content = Mb_Eregi_Replace ( '\b(' . $term . '?)\b', '<span class="HighlightTerm">\\1</span>', $Content );

But it also doesn't work. It will match "SREČA" if the search term is "SREČA", but if the search term is "sreča" it will not match it (and vice versa).

So how do I make this work?

update: I set the locale and internal encoding:

Mb_Internal_Encoding ( 'UTF-8' );
$loc = "UTF-8";
putenv("LANG=$loc");
$loc = setlocale(LC_ALL, $loc);

Have you considered what will happen if the user enters a special character such as a / or * in the search query? — Mark Byers, Commented Jan 14, 2010 at 9:30
Search term is sanitized before I do anything with it. Thanks for the comment though. — Jan Hančič, Commented Jan 14, 2010 at 9:31

Jan Hančič · Accepted Answer · 2010-01-17 14:04:55Z

6

I feel really stupid right about now but the problem wasn't with Preg_* functions at all. I don't know why but I first checked if the given term is even in the string with StriPos and since that function is not multi-byte safe it returned false if the case of the text was not the same as the search term, so the Preg_Replace wasn't even called.

So the lesson to be learned here is that always use multi-byte versions of functions if you have UTF8 strings.

answered Jan 17, 2010 at 14:04

Jan Hančič

53.7k17 gold badges97 silver badges101 bronze badges

1

Amen, brother. Amen.
– Anthony Rutledge
Commented Jun 21, 2016 at 15:01

Add a comment |

gnarf · Accepted Answer · 2010-01-14 10:40:00Z

3

Not sure what your problem is stemming from, but I just put together this little test case:

<?php

$uc = "SREČA";

mb_internal_encoding('utf-8');
echo $uc."\n";
$lc = mb_strtolower($uc);
echo $lc."\n";

echo preg_replace("/\b(".preg_quote($uc).")\b/ui", "<span class='test'>$1</span>", "test:".$lc." end test");

It's output on my machine:

SREČA
sreča
test:<span class='test'>sreča</span> end test

Seems to be working properly?

edited Jan 14, 2010 at 10:40

answered Jan 14, 2010 at 10:23

gnarf

106k25 gold badges128 silver badges161 bronze badges

Adding mb_regex_encoding does not solve the issue (I already have the other two) :\
– Jan Hančič
Commented Jan 14, 2010 at 10:26

Add a comment |

troelskn · Accepted Answer · 2010-01-14 10:00:47Z

2

If I'm not mistaken, preg_match uses the current locale. Try setting the locale to the language which these characters belongs to. You probably need a utf8 based locale too. If you have mixed languages in your page, you may be able to find a generic international locale that works.

See also: http://www.phpwact.org/php/i18n/utf-8

answered Jan 14, 2010 at 10:00

troelskn

117k27 gold badges132 silver badges156 bronze badges

1

UTF-8 is probably not a valid locale on any system. Try running locale -a on a shell, to get the supported locales. You probably want one that looks like en_GB.utf8.
– troelskn
Commented Jan 14, 2010 at 10:16

Add a comment |

Collectives™ on Stack Overflow

Preg_Replace and UTF8

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
php
regex
utf-8
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged phpregexutf-8 or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
php
regex
utf-8
or ask your own question.