19

I've got a text input from a mobile device. It contains emoji. In C#, I have the text as

Text 🍫🌐 text

Simply put, I want the output text to be

Text text

I'm trying to just remove all such emojis from the text with rejex.. except, I'm not sure how to convert that emoji into it's unicode sequence.. How do I do that?

edit:

I'm trying to save the user input into mysql. It looks like mysql UTF8 doesn't really support unicode characters and the right way to do it would be by changing the schema but I don't think that is an option for me. So I'm trying to just remove all the emoji characters before saving it in the database.

This is my schema for the relevant column:

enter image description here

I'm using Nhibernate as my ORM and the insert query generated looks like this:

Insert into `Content` (ContentTypeId, Comments, DateCreated) 
values (?p0, ?p1, ?p2);
?p0 = 4 [Type: Int32 (0)]. ?p1 = 'Text 🍫🌐 text' [Type: String (20)], ?p2 = 19/01/2015 10:38:23 [Type: DateTime (0)]

When I copy this query from logs and run it on mysql directly, I get this error:

1 warning(s): 1366 Incorrect string value: '\xF0\x9F\x98\x80 t...' for column 'Comments' at row 1   0.000 sec

Also, I've tried to convert it into encoding bytes and it doesn't really work..

enter image description here

12
  • UTF-8 really should be fine here. Can you post the details of how you're currently trying to save the data, along with your schema information?
    – Jon Skeet
    Commented Jan 19, 2015 at 11:41
  • 1
    See here: gist.github.com/adamlwatson/9623703
    – Octopoid
    Commented Jan 19, 2015 at 11:41
  • 2
    @LocustHorde Which version of MySQL are you running on? Seemingly the character set utf8mb4 should make everything tikitiboo... have a read of the answer here stackoverflow.com/questions/24253985/… "It seems that MySQL supports two forms of unicode ucs2 which is 16-bits per character and utf8 up to 3 bytes per character. The bad news is that neither form is going to support plane 1 characters which require at 17 bits. (mainly emoji). It looks like MySQL 5.5.3 and up also support utf8mb4, utf16, and utf32 and supplementary characters (read emoji)"
    – BLoB
    Commented Jan 19, 2015 at 12:00
  • 3
    Something to be aware of from stackoverflow.com/questions/10992921/… "However, note that there are other characters in the Basic Multilingual Plane that are used as emoji by phones but which long predate emoji. For example U+2665 is the traditional Heart Suit character ♥, but it my be rendered as an emoji graphic on some devices. It's up to you whether you treat this as emoji and try to remove it."
    – BLoB
    Commented Jan 19, 2015 at 12:32
  • 1
    Octopoid's gist doesn't convert them, it removes them. If you want to just remove any characters not in the BMP, that's reasonably easy.
    – Jon Skeet
    Commented Jan 19, 2015 at 12:46

1 Answer 1

59

Assuming you just want to remove all non-BMP characters, i.e. anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. For example:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string text = "x\U0001F310y";
        Console.WriteLine(text.Length); // 4
        string result = Regex.Replace(text, @"\p{Cs}", "");
        Console.WriteLine(result); // 2
    }
}

Here "Cs" is the Unicode category for "surrogate".

It appears that Regex works based on UTF-16 code units rather than Unicode code points, otherwise you'd need a different approach.

Note that there are non-BMP characters other than emoji, but I suspect you'll find they'll have the same problem when you try to store them.

Additionally, not that this won't remove emojis in the BMP, such as U+2764 (red heart). You can use the above as an example of how to remove characters in specific Unicode categories - the category for U+2764 is "other symbol" for example. Now whether you want to remove all "other symbols" is a different matter.

But if really you're interested in just removing surrogate pairs because they can't be stored properly, the above should be fine.

10
  • Hi, I made the question to describe what I thought was my problem.. but I tried out your answer and it turns out I don't actually need to convert them.. So I have edited the question now! i.imgur.com/NoQfxud.png Thank you! Commented Jan 19, 2015 at 14:48
  • @LocustHorde: So long as you're aware that you're just throwing away bits of the user's input...
    – Jon Skeet
    Commented Jan 19, 2015 at 14:54
  • 1
    @GilSand: Well, did you look at what Unicode categories those characters are in? It's probably best to ask a new question with a complete example, rather than "one or two of them" (leaving us guessing which). We can then look at what's going on much more easily.
    – Jon Skeet
    Commented Oct 24, 2017 at 7:49
  • 1
    @JonSkeet You're right. Here's a link to the new question for you or future travelers : stackoverflow.com/questions/46905176/detecting-all-emojis
    – Gil Sand
    Commented Oct 24, 2017 at 8:02
  • 1
    @Clement: Yes, but it will also remove "other symbols" that aren't emojis... e.g. the copyright sign ©. If I were only trying to remove emoji, I wouldn't expect the copyright sign to be removed.
    – Jon Skeet
    Commented May 17, 2023 at 6:11

Not the answer you're looking for? Browse other questions tagged or ask your own question.