vb.net: Encoding byte array into string using Unicode

Question

I am reading RAW data from a source. This raw data is a sequence of Bytes. I store this sequence of Bytes into an array of Bytes that I define as following in VB.NET:

Dim frame() as Byte

so each element in the above array is in the range [0-255].

I want to encode each of these bytes into ASCII, UTF-8 and Unicode so I iterate over the byte array (frame) and perform below snippet code depending on the case:

ASCII:

For idxByte As Integer = 0 To Me.frame.Length - 1
    txtRefs(idxByte).Text = Encoding.ASCII.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next

Note: txtRefs is an array of textboxes, and its length is the same as frame.

And similar for the other two encodings:

UTF-8:

For idxByte As Integer = 0 To Me.frame.Length - 1
    txtRefs(idxByte).Text = Encoding.UTF8.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next

Unicode:

For idxByte As Integer = 0 To Me.frame.Length - 1
    txtRefs(idxByte).Text = Encoding.Unicode.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next

ASCII and UTF8 encoding seems ok, but Unicode encoding seems it is not working or maybe I am not understanding Unicode encoding functionality at all...

For unicode I get below result by executing above loop. Is this correct?

Mark · Accepted Answer · 2016-03-08 17:48:03Z

1

Encoding.Unicode is UTF-16 LE, so it needs two bytes to give the correct results. e.g.

Dim input() As Byte = { 65, 0 }
Dim x = Encoding.Unicode.GetString(input, 0, 1)
Dim y = Encoding.Unicode.GetString(input, 0, 2)
Console.WriteLine("x={0}, y={1}", x, y)

x=�, y=A

However, if your input is single byte per character you probably don't want to just pass two bytes from your input array. You may want to create a new input array with a zero second byte:

Dim input() As Byte = { 65, 0 }
Dim x = Encoding.Unicode.GetString(input, 0, 1)
Dim y = Encoding.Unicode.GetString(input, 0, 2)
Dim z = Encoding.Unicode.GetString(New Byte() { input(0), 0 })
Console.WriteLine("x={0}, y={1}, z={2}", x, y, z)

x=�, y=A, z=A

Hard to tell without knowing your input and desired output.

answered Mar 8, 2016 at 17:48

Mark

8,1501 gold badge15 silver badges29 bronze badges

Yes, my input is single byte per character, so in case of unicode, creating a new input array with a zero for second byte is working. Now I am wondering if I need to do the same for UTF-8 because Encoding.UTF8.IsSingleByte is returning false, so in case of UTF-8 is this correct to follow the same approach as unicode? that is, build a new array with a zero for the second byte
– Willy
Commented Mar 9, 2016 at 2:05
@user1624552 UTF-8 is more complicated, since it uses a variable number of bytes - 1 to 6 bytes - depending on the code point. This shows the details - for values <= 127 a single byte will work. I'm not sure I understand why you want to do this, since really you are just treating the value as ASCII or some single byte OEM encoding anyway. If you want to know how those characters are encoded in different encodings then you can just Encoding.Unicode.GetBytes(Encoding.ASCII.GetString(frame)) with whatever source/target encodings you want.
– Mark
Commented Mar 9, 2016 at 2:47
What I am trying to do is for example: for a single byte, obtain its character resprentation into different encodings ASCII, UTF-8, unicode,.... for example byte 97 corresponds to character 'a' in ascii, byte 97 corresponds to character X in unicode, byte 97 corresponds to character X in UTF-8, etc... and so on for the bytes in the range [0-255]
– Willy
Commented Mar 9, 2016 at 3:08

Add a comment |

Tom Blodget · Accepted Answer · 2016-03-08 18:29:28Z

For ASCII, each byte is a code unit, is a codepoint, is a character, is a glyph.

For UTF-8, each byte is a code unit, one or more is a codepoint, one or more is a glyph.

For UTF-16, each two bytes is a code unit, one or more is a codepoint, one or more is a glyph.

To convert a sequence of bytes, just use one call to GetString for the appropriate Encoding instance. Then you'll be dealing with String, which is a counted sequence of Unicode/UTF-16 codepoints.

The built-in Encoding classes use a substitution character ("?") when the bytes don't make sense for the encoding. If you prefer you can create an instance with a DecoderFallback exception so you'll be able to handle those cases. For example, 0xFF is never a valid ASCII code unit; The 0xCD is a valid code unit in UTF-8 but the sequence 0xCD 0x20 is not valid.

Presumably, you want to separate glyphs for display purposes. See TextElementEnumerator.

Collectives™ on Stack Overflow

vb.net: Encoding byte array into string using Unicode

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
vb.net
unicode
encoding
arrays
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged vb.netunicodeencodingarrays or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
vb.net
unicode
encoding
arrays
or ask your own question.