How do I remove emoji characters from a string?

Question

I've got a text input from a mobile device. It contains emoji. In C#, I have the text as

Text 🍫🌐 text

Simply put, I want the output text to be

Text text

I'm trying to just remove all such emojis from the text with rejex.. except, I'm not sure how to convert that emoji into it's unicode sequence.. How do I do that?

edit:

I'm trying to save the user input into mysql. It looks like mysql UTF8 doesn't really support unicode characters and the right way to do it would be by changing the schema but I don't think that is an option for me. So I'm trying to just remove all the emoji characters before saving it in the database.

This is my schema for the relevant column:

enter image description here

I'm using Nhibernate as my ORM and the insert query generated looks like this:

Insert into `Content` (ContentTypeId, Comments, DateCreated) 
values (?p0, ?p1, ?p2);
?p0 = 4 [Type: Int32 (0)]. ?p1 = 'Text 🍫🌐 text' [Type: String (20)], ?p2 = 19/01/2015 10:38:23 [Type: DateTime (0)]

When I copy this query from logs and run it on mysql directly, I get this error:

1 warning(s): 1366 Incorrect string value: '\xF0\x9F\x98\x80 t...' for column 'Comments' at row 1   0.000 sec

Also, I've tried to convert it into encoding bytes and it doesn't really work..

enter image description here

UTF-8 really should be fine here. Can you post the details of how you're currently trying to save the data, along with your schema information? — Jon Skeet, Commented Jan 19, 2015 at 11:41
@LocustHorde Which version of MySQL are you running on? Seemingly the character set utf8mb4 should make everything tikitiboo... have a read of the answer here stackoverflow.com/questions/24253985/… "It seems that MySQL supports two forms of unicode ucs2 which is 16-bits per character and utf8 up to 3 bytes per character. The bad news is that neither form is going to support plane 1 characters which require at 17 bits. (mainly emoji). It looks like MySQL 5.5.3 and up also support utf8mb4, utf16, and utf32 and supplementary characters (read emoji)" — BLoB, Commented Jan 19, 2015 at 12:00
Something to be aware of from stackoverflow.com/questions/10992921/… "However, note that there are other characters in the Basic Multilingual Plane that are used as emoji by phones but which long predate emoji. For example U+2665 is the traditional Heart Suit character ♥, but it my be rendered as an emoji graphic on some devices. It's up to you whether you treat this as emoji and try to remove it." — BLoB, Commented Jan 19, 2015 at 12:32
Octopoid's gist doesn't convert them, it removes them. If you want to just remove any characters not in the BMP, that's reasonably easy. — Jon Skeet, Commented Jan 19, 2015 at 12:46

Jon Skeet · Accepted Answer · 2023-04-28 06:17:19Z

59

Assuming you just want to remove all non-BMP characters, i.e. anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. For example:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string text = "x\U0001F310y";
        Console.WriteLine(text.Length); // 4
        string result = Regex.Replace(text, @"\p{Cs}", "");
        Console.WriteLine(result); // 2
    }
}

Here "Cs" is the Unicode category for "surrogate".

It appears that Regex works based on UTF-16 code units rather than Unicode code points, otherwise you'd need a different approach.

Note that there are non-BMP characters other than emoji, but I suspect you'll find they'll have the same problem when you try to store them.

Additionally, not that this won't remove emojis in the BMP, such as U+2764 (red heart). You can use the above as an example of how to remove characters in specific Unicode categories - the category for U+2764 is "other symbol" for example. Now whether you want to remove all "other symbols" is a different matter.

But if really you're interested in just removing surrogate pairs because they can't be stored properly, the above should be fine.

edited Apr 28, 2023 at 6:17

answered Jan 19, 2015 at 13:36

Jon Skeet

1.5m881 gold badges9.2k silver badges9.3k bronze badges

Hi, I made the question to describe what I thought was my problem.. but I tried out your answer and it turns out I don't actually need to convert them.. So I have edited the question now! i.imgur.com/NoQfxud.png Thank you!
– LocustHorde
Commented Jan 19, 2015 at 14:48
@LocustHorde: So long as you're aware that you're just throwing away bits of the user's input...
– Jon Skeet
Commented Jan 19, 2015 at 14:54
1

@GilSand: Well, did you look at what Unicode categories those characters are in? It's probably best to ask a new question with a complete example, rather than "one or two of them" (leaving us guessing which). We can then look at what's going on much more easily.
– Jon Skeet
Commented Oct 24, 2017 at 7:49
1

@JonSkeet You're right. Here's a link to the new question for you or future travelers : stackoverflow.com/questions/46905176/detecting-all-emojis
– Gil Sand
Commented Oct 24, 2017 at 8:02
1

@Clement: Yes, but it will also remove "other symbols" that aren't emojis... e.g. the copyright sign ©. If I were only trying to remove emoji, I wouldn't expect the copyright sign to be removed.
– Jon Skeet
Commented May 17, 2023 at 6:11

| Show 5 more comments

Collectives™ on Stack Overflow

How do I remove emoji characters from a string?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
c#
mysql
unicode
emoji
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged c#mysqlunicodeemoji or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c#
mysql
unicode
emoji
or ask your own question.