16

I have a text in Burmese language, UTF-8. I am using PHP to work with the text. At some point along the way, some ZWSPs have crept in and I would like to remove them. I have tried two different ways of removing the characters, and neither seems to work.

First I have tried to use:

  $newBody = str_replace("​", "", $newBody);

to search for the HTML entity and remove it, as this is how it appears under Web Inspector. The spaces don't get removed. I have also tried it as:

  $newBody = str_replace("&#8203", "", $newBody);

and get the same no result.

The second method I tried was found on this question Remove ZERO WIDTH NON-JOINER character from a string in PHP

which looked like this:

 $newBody = str_replace("\xE2\x80\x8C", "", $newBody);

but I also got no result. The ZWSP was not removed.

An example word in the text ($newBody) looks like this : ယူ​​က​​ရိန်
And I want to make it look like this : ယူကရိန်း

Any ideas? Would a preg_replace work better somehow?

So I did try

$newBody = preg_replace("/\xE2\x80\x8B/", "", $newBody);

and it appears to be workings, but now there is another issue.

<a class="defined" title="Ukraine">ယူ&#8203;က&#8203;ရိန်း</a>

gets transformed into

<a class="defined _tt_t_" title="Ukraine" style="font-family: 'Masterpiece Uni Sans', TharLon, Myanmar3, Yunghkio, Padauk, Parabaik, 'WinUni Innwa', 'Win Uni Innwa', 'MyMyanmar Unicode', Panglong, 'Myanmar Sangam MN', 'Myanmar MN';">ယူကရိန်း</a>

I don't want it to add all that extra stuff. Any ideas why this is happening? Apart from coming up with some way to target only the text in between , is there another way to prevent the preg_replace from adding all this extra stuff? Btw, using google chrome on a mac. It seems to act a bit differently with firefox...

2
  • 2
    Can you provide a short example of what $newBody might contain, and what you would like it to contain instead? The best way to remove nuisance characters is to understand how they got there in the first place.
    – codebeard
    Commented Mar 24, 2014 at 2:48
  • That last part is strange. Did you view the output using "inspect element" or "view source"? Commented Mar 24, 2014 at 7:21

2 Answers 2

22

This:

$newBody = str_replace("&#8203;", "", $newBody);

presumes the text is HTML entity encoded. This:

$newBody = str_replace("\xE2\x80\x8C", "", $newBody);

should work if the offending characters are not encoded, but matches the wrong character (0xe2808c). To match the same character as #8203; you need 0xe2808b:

$newBody = str_replace("\xE2\x80\x8B", "", $newBody);
5
  • 2
    so, that doesn't appear to be working, but I did try $newBody = preg_replace("/\xE2\x80\x8B/", "", $newBody); and it did work.
    – Jimmy Long
    Commented Mar 24, 2014 at 3:16
  • Great! FYI, when I copy/pasted your sample I found it has both encoded and unencoded examples of <200b>, so I presume you are also still doing a replace on '&#8203;'
    – Jef
    Commented Mar 24, 2014 at 3:20
  • I did try replacing '&#8203;' before, but I am not now. So maybe it's leftover. How did you see the encoded unencoded examples? I'm still a bit green regarding all this unicode stuff... And any ideas regarding the last part of the question I just added?
    – Jimmy Long
    Commented Mar 24, 2014 at 3:23
  • 2
    The html entity encoded strings (&#8203;) I could see in my browser. The unencoded example I could only see when I pasted your example into vi (a text editor commonly used in Linux/Unix), where it shows up as "<200b>" (the UTF-16 representation in hex) - though this is dependent on how my vi is set-up. A good starting point for getting your head around character set issues is: joelonsoftware.com/articles/Unicode.html
    – Jef
    Commented Mar 25, 2014 at 15:32
  • str_replace("\xE2\x80\x8C", "", $content); worked for me
    – Gavin
    Commented Apr 28, 2022 at 17:32
11

If you want to remove zero width space characters from an UTF-8 string:

$string = preg_replace('/[\x{200B}-\x{200D}\x{FEFF}]/u', '', $string);

References:

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.