5

I have a byte array which I believe correctly stores a UTF-16 encoded Surrogate Pair for the unicode character 𐎑

Running that byte array through .Net System.Text.Encoding.Unicode.GetString() returns non-expected results.

Actual results: ��

Expected results: 𐎑

Code example:

byte[] inputByteArray = new byte[4];
inputByteArray[0] = 0x91;
inputByteArray[1] = 0xDF;
inputByteArray[2] = 0x00;
inputByteArray[3] = 0xD8;

// System.Text.Encoding.Unicode accepts little endian UTF-16
// Least significant byte first within the byte array [0] MSByete in [3]
string str = System.Text.Encoding.Unicode.GetString(inputByteArray);

// This returns �� rather than the excpected symbol: 𐎑 
Console.WriteLine(str);

Detail on how I got to that particular byte array from the character : 𐎑

This character is within the Supplementary Multilingual Plane. This character in Unicode is 0x10391. Encoded into a UTF-16 surrogate pair, this should be :

Minus the Unicode value with 0x10000 : val = 0x00391 = (0x10391 - 0x10000)

High surrogate: 0xD800 = ( 0xD800 + (0x00391 >> 10 )) top 10 bits

Low surrogate: 0xDF91 = (0xDC00 + (0x00391 & 0b_0011_1111_1111)) bottom 10 bits

1 Answer 1

6

Encoding.Unicode is little-endian on a per-UTF-16 code unit basis. You still need to put the high surrogate code unit before the low surrogate code unit. Here's sample code that works:

using System;
using System.Text;

class Test
{
    static void Main()
    {
        byte[] data =
        {
            0x00, 0xD8, // High surrogate
            0x91, 0xDF  // Low surrogate
        };
        string text = Encoding.Unicode.GetString(data);
        Console.WriteLine(char.ConvertToUtf32(text, 0)); // 66449
    }
}

Not the answer you're looking for? Browse other questions tagged or ask your own question.