0

Currently I'm using a HashSet of Tuples called Emoji to replace Emoji to a string representation so that for example the emoji for bomb becomes U0001F4A3. The conversion's done via

Emoji.Aggregate(input, (current, pair) => current.Replace(pair.Item1, pair.Item2));

Works as expected.

However I'm trying to achieve the same thing without making use of predefined list of 2600+ items. Did anyone already achieve such a thing where the Emoji in a string are replaced with their counterpart without leading \?

For example:

"This string contains the unicode character bomb (💣)"

becomes

"This string contains the unicode character bomb (U0001F4A3)"
3
  • Would you want to replace "every character above U+FFFF" with its hex representation? If so, that's relatively straightforward. If it's not that simple (either because there are characters in the BMP you want to replace, or there are characters not in the BMP that you don't want to replace), that's harder.
    – Jon Skeet
    Commented Jul 12, 2018 at 16:12
  • That would work yes. Commented Jul 12, 2018 at 16:18
  • Okay, will write an answer.
    – Jon Skeet
    Commented Jul 12, 2018 at 16:19

2 Answers 2

2

It sounds like you're happy to replace any character not in the basic multi-lingual plane with its hex representation. The code to do that is slightly longwinded, but it's pretty simple:

using System;
using System.Text;

class Test
{
    static void Main()
    {
        string text = "This string contains the unicode character bomb (\U0001F4A3)";
        Console.WriteLine(ReplaceNonBmpWithHex(text));
    }

    static string ReplaceNonBmpWithHex(string input)
    {
        // TODO: If most string don't have any non-BMP characters, consider
        // an optimization of checking for high/low surrogate characters first,
        // and return input if there aren't any.
        StringBuilder builder = new StringBuilder(input.Length);
        for (int i = 0; i < input.Length; i++)
        {
            char c = input[i];
            // A surrogate pair is a high surrogate followed by a low surrogate
            if (char.IsHighSurrogate(c))
            {
                if (i == input.Length -1)
                {
                    throw new ArgumentException($"High surrogate at end of string");
                }
                // Fetch the low surrogate, advancing our counter
                i++;
                char d = input[i];
                if (!char.IsLowSurrogate(d))
                {
                    throw new ArgumentException($"Unmatched low surrogate at index {i-1}");
                }
                uint highTranslated = (uint) ((c - 0xd800) * 0x400);
                uint lowTranslated = (uint) (d - 0xdc00);
                uint utf32 = (uint) (highTranslated + lowTranslated + 0x10000);
                builder.AppendFormat("U{0:X8}", utf32);
            }
            // We should never see a low surrogate on its own
            else if (char.IsLowSurrogate(c))
            {
                throw new ArgumentException($"Unmatched low surrogate at index {i}");
            }
            // Most common case: BMP character; just append it.
            else
            {
                builder.Append(c);
            }
        }
        return builder.ToString();
    }
}

Note that this does not attempt to handle the situation where multiple characters are used together, as per Yury's answer. It would replace each modifier/emoji/secondary-char as a separate UXXXXXXXX part.

2
  • Thanks for providing this piece of code already Daisy. Much appreciated. I'll use it as a base for my coding. I noticed that not all emoji are being translated. For example the frowning face. Commented Jul 13, 2018 at 6:08
  • 1
    @KrisvanderMast: Do you mean U+2639? (fileformat.info/info/unicode/char/2639/index.htm) If so, that's in the BMP, which would be why it's not being replaced. You may need a list of "extra" characters (or possibly code blocks) to include. I think this may come down to what you count as an emoji.
    – Jon Skeet
    Commented Jul 13, 2018 at 6:28
0

I'm afraid you have one false assumption here. Emoji is not just a "special Unicode char". Actual length of particular emoji can be 4 or more chars in a row. For instance:

  • emoji itself
  • zero-width jointer
  • secondary char (like graduation cap or microphone)
  • gender modifier (man or woman)
  • skin tone modifier (Fitzpatrick Scale)

So, you should take into consideration that variable length for sure.

Examples:

1
  • Yes I was aware of that. Before taking this endeavor I wrote 2600+ unit tests, one for each emoji and found out that (wo)men wrestling with some tone provided different characters instead of one while the flag of Wales is considered one but when you take a look at what's being provided it's a big amount of Unicode characters making it up. The main purpose is to "translate" them so we can make use of them in an NLP engine which by default doesn't understand emoji (at least the service we're using at the moment). Commented Jul 13, 2018 at 6:05

Not the answer you're looking for? Browse other questions tagged or ask your own question.