0

I am reading RAW data from a source. This raw data is a sequence of Bytes. I store this sequence of Bytes into an array of Bytes that I define as following in VB.NET:

Dim frame() as Byte

so each element in the above array is in the range [0-255].

I want to encode each of these bytes into ASCII, UTF-8 and Unicode so I iterate over the byte array (frame) and perform below snippet code depending on the case:

ASCII:

For idxByte As Integer = 0 To Me.frame.Length - 1
    txtRefs(idxByte).Text = Encoding.ASCII.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next

Note: txtRefs is an array of textboxes, and its length is the same as frame.

And similar for the other two encodings:

UTF-8:

For idxByte As Integer = 0 To Me.frame.Length - 1
    txtRefs(idxByte).Text = Encoding.UTF8.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next

Unicode:

For idxByte As Integer = 0 To Me.frame.Length - 1
    txtRefs(idxByte).Text = Encoding.Unicode.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next

ASCII and UTF8 encoding seems ok, but Unicode encoding seems it is not working or maybe I am not understanding Unicode encoding functionality at all...

For unicode I get below result by executing above loop. Is this correct?

0

2 Answers 2

1

Encoding.Unicode is UTF-16 LE, so it needs two bytes to give the correct results. e.g.

Dim input() As Byte = { 65, 0 }
Dim x = Encoding.Unicode.GetString(input, 0, 1)
Dim y = Encoding.Unicode.GetString(input, 0, 2)
Console.WriteLine("x={0}, y={1}", x, y)

x=�, y=A

However, if your input is single byte per character you probably don't want to just pass two bytes from your input array. You may want to create a new input array with a zero second byte:

Dim input() As Byte = { 65, 0 }
Dim x = Encoding.Unicode.GetString(input, 0, 1)
Dim y = Encoding.Unicode.GetString(input, 0, 2)
Dim z = Encoding.Unicode.GetString(New Byte() { input(0), 0 })
Console.WriteLine("x={0}, y={1}, z={2}", x, y, z)

x=�, y=A, z=A

Hard to tell without knowing your input and desired output.

3
  • Yes, my input is single byte per character, so in case of unicode, creating a new input array with a zero for second byte is working. Now I am wondering if I need to do the same for UTF-8 because Encoding.UTF8.IsSingleByte is returning false, so in case of UTF-8 is this correct to follow the same approach as unicode? that is, build a new array with a zero for the second byte
    – Willy
    Commented Mar 9, 2016 at 2:05
  • @user1624552 UTF-8 is more complicated, since it uses a variable number of bytes - 1 to 6 bytes - depending on the code point. This shows the details - for values <= 127 a single byte will work. I'm not sure I understand why you want to do this, since really you are just treating the value as ASCII or some single byte OEM encoding anyway. If you want to know how those characters are encoded in different encodings then you can just Encoding.Unicode.GetBytes(Encoding.ASCII.GetString(frame)) with whatever source/target encodings you want.
    – Mark
    Commented Mar 9, 2016 at 2:47
  • What I am trying to do is for example: for a single byte, obtain its character resprentation into different encodings ASCII, UTF-8, unicode,.... for example byte 97 corresponds to character 'a' in ascii, byte 97 corresponds to character X in unicode, byte 97 corresponds to character X in UTF-8, etc... and so on for the bytes in the range [0-255]
    – Willy
    Commented Mar 9, 2016 at 3:08
0

For ASCII, each byte is a code unit, is a codepoint, is a character, is a glyph.

For UTF-8, each byte is a code unit, one or more is a codepoint, one or more is a glyph.

For UTF-16, each two bytes is a code unit, one or more is a codepoint, one or more is a glyph.

To convert a sequence of bytes, just use one call to GetString for the appropriate Encoding instance. Then you'll be dealing with String, which is a counted sequence of Unicode/UTF-16 codepoints.

The built-in Encoding classes use a substitution character ("?") when the bytes don't make sense for the encoding. If you prefer you can create an instance with a DecoderFallback exception so you'll be able to handle those cases. For example, 0xFF is never a valid ASCII code unit; The 0xCD is a valid code unit in UTF-8 but the sequence 0xCD 0x20 is not valid.

Presumably, you want to separate glyphs for display purposes. See TextElementEnumerator.

Not the answer you're looking for? Browse other questions tagged or ask your own question.