2

I have a situation where a server may arbitrarily break up transmitted UTF-8 string data, including in the middle of a UTF-8 sequence. In the websocket proxy that is receiving this data before it goes to the client, I want to detect that case and have the proxy wait for the next packet from the server and concatenate it with the prior one before sending to the client.

Assuming I am seeing the data from the server as a simple array of bytes, what is the simplest logic I can use to reliably detect the case where those bytes end in the middle of a UTF-8 sequence?

3
  • Take a look at the UTF-8 definition. Starting-bytes are distinct from continuation-bytes, and encode the number of continuation-bytes which follow. Thus, you can easily determine whether the last codepoint is complete. It gets more complicated if you want to consider graphemes, words or sentences. Commented Dec 21, 2014 at 5:14
  • Why would you want to detect possible UTF8 decoding issues when you have (websocket) frame length at your disposal? You should be able to simply wait until you see whole frame and then forward it to the client.. Commented Dec 22, 2014 at 2:19
  • @PavelBucek: I am communicating to the server with a socket, not a websocket. The purpose of the websocket proxy is to provide a websocket to the client.
    – chaos
    Commented Dec 22, 2014 at 4:08

2 Answers 2

1

This is the logic I wound up using (in JavaScript):

function incompleteUTF8(buf) {
    for(var ix = Math.max(buf.length - 6, 0); ix < buf.length; ix++) {
        var ch = buf[ix];
        if(ch < 0x80)
            continue;
        if((ch & 0xe0) === 0xc0)
            ix++;
        else if((ch & 0xf0) === 0xe0)
            ix += 2;
        else if((ch & 0xf8) === 0xf0)
            ix += 3;
        else if((ch & 0xfc) === 0xf8)
            ix += 4;
        else if((ch & 0xfe) === 0xfc)
            ix += 5;
        else
            continue;
        if(ix >= buf.length)
            return true;
    }
    return false;
}
0

All you need to do is to process the bytes you receive using a UTF-8 scanner that handles pushing of bytes to it, rather than trying to read (pull) bytes. You push each received byte in turn to the scanner. Each time it completes processing of an encoded character it pushes the character downstream. It maintains a small buffer of bytes that are not yet part of a completely encoded character, if necessary.

If you do that, your code enters a wait state when the scanner buffer contains a pushed byte.

Not the answer you're looking for? Browse other questions tagged or ask your own question.