5

I need to convert a byte array to a UTF8 string, and preserve the characters in the array.

Im uploading an image using multipart post. The image is sent along as a UTF8 string. I've compared the headers from my app and web browser and the data is the same, apart from one thing.

When it send along the browser, the content contains lots of [] characters, where as my app replaces [] with ?. Which means it's not preserving the characters as it should. Everything else is the same.

Heres the code I have atm

Byte[] fileOpen = File.ReadAllBytes("C:/pic.jpeg");
postData.AppendLine(System.Text.Encoding.UTF8.GetString(fileOpen));

Any advice?

5
  • 4
    A jpeg file doesn't contain UTF8 encoded text. What are you trying to do?
    – Mark Byers
    Commented Feb 21, 2010 at 11:31
  • You can't send a JPG file as UTF8 text. You have to send it as a JPG file, i.e. image/jpeg. Commented Feb 21, 2010 at 11:32
  • 2
    If I am not mistaken, the best way to pass binary data with strings is to convert it to base64 first. Commented Feb 21, 2010 at 11:33
  • @Tim I am sending the date in a multipart post request. As I said below, the data between my app and the browser is the same, apart from the browser headers are displaying [] in the content, my app is displaying ? marks. The content between the [] is exactly the same in both requests. Commented Feb 21, 2010 at 11:42
  • Try attaching a communication log produced using an application like Wireshark that shows a working upload from the web browser. We should then be able to figure out how to reproduce the same behavior in C# and .NET. Commented Feb 21, 2010 at 12:49

3 Answers 3

7

The image is sent along as a UTF8 string.

Why? UTF-8 is a text encoding. Raw binary data should not be encoded but rather sent directly as bytes.

If your transfer protocol doesn't allow byte transfer, then the usual way is to encode the byes in Base64.

3
  • Well this is strange. I have no control over the server I am sending to. But when examining the header content/body data, my app and the browser request look similar in fiddler, apart from my app is replacing [] with ? but the numbers and characters in between the [] are exactly the same. If you get what I mean. Commented Feb 21, 2010 at 11:39
  • @James: those are not [] characters. they are undisplayable Unicode characters (A side-effect of reading a binary file like a text file.) Mos likely, you should be setting the content-encoding of your POST to 8-bit-binary or similar. Commented Feb 21, 2010 at 11:43
  • @John, but the headers in the browser request do not set a content-encoding header, well it does, it's gzip, deflate. Commented Feb 21, 2010 at 11:46
2

Don't try to send the data using anything approaching a text API. You haven't said what postData is, but try to find some part of its API which deals with streams of binary data instead of text data. Looks for methods along the lines of AppendBytes, or GetStream to retrieve a stream you can write your data to.

Pretending that arbitrary binary data is text is a bad idea - you will lose data.

EDIT: One way which tends not to lose data (but is still a bad idea) is to treat binary data as an ISO-8859-1-encoded document. IIRC there is some debate about exactly what ISO-8859-1 contains in positions 128-159, but most encodings at least assume Unicode 128-159 as well.

Your "UTF-8 decoding" of the binary data may look like the correct data because for values 0-127, they're the same - it's only above that that you'll have problems. However, you should still avoid treating this binary data as text. It's not text, and treating it as text is simply a recipe for disaster.

If you could post the headers sent by your browser (including the headers of the part of the multipart that correspond to the image), we can hopefully help you slightly further - but the bottom line is that you should find a way of handing whatever API you're using (that would be useful information too) the raw binary data without going via text.

6
  • Jon Skeen, I really wish there was a better way of doing it, but the problem is I am sending to a server out of my control. I know 100% the server is using UTF8 to send its image and decode it on their end. I wish I could could show you the data from the headers you would see what I mean. Commented Feb 21, 2010 at 12:21
  • @James: I'm afraid I don't believe you - I'm not saying you're lying, just misinterpreting the data. Most images will simply not be valid UTF-8. If you really want to show us some header data, why not put it in the question? It may be sending a base64 string encoded as UTF-8, but that's a different matter - and should be reasonably obvious if you look at the data.
    – Jon Skeet
    Commented Feb 21, 2010 at 12:26
  • Just as a further point, if you'd said ISO-Latin-1 (aka ISO-8859-1) that would be slightly more believable - see my edit.
    – Jon Skeet
    Commented Feb 21, 2010 at 12:31
  • Jon is right. Most images would be destroyed by treating them as if they were UTF-8 string. Just try taking any byte array with 0x88 in it, call UTF8.GetString followed by UTF8.GetBytes and see what you get - 0x88 won't be there any longer. Commented Feb 21, 2010 at 12:39
  • John and the rest who are saying they don't believe me, see my answer below. I have solved it, and it WAS UTF8. Commented Feb 21, 2010 at 13:33
1

To John and the other guys saying they don't believe me. I've solved it. Converting it to a string caused problems, but writting it directly to the request stream worked.

public string solveCaptcha(String username, String password)
    {
        String boundry = "---------------------------" + DateTime.Now.Ticks.ToString("x");

        StringBuilder postData = new StringBuilder();
        postData.AppendLine("--" + boundry);
        postData.AppendLine("Content-Disposition: form-data; name=\"function\"");
        postData.AppendLine("");
        postData.AppendLine("picture2");
        postData.AppendLine("--" + boundry);
        postData.AppendLine("Content-Disposition: form-data; name=\"username\"");
        postData.AppendLine("");
        postData.AppendLine(username);
        postData.AppendLine("--" + boundry);
        postData.AppendLine("Content-Disposition: form-data; name=\"password\"");
        postData.AppendLine("");
        postData.AppendLine(password);
        postData.AppendLine("--" + boundry);
        postData.AppendLine("Content-Disposition: form-data; name=\"pict\"; filename=\"pic.jpeg\"");
        postData.AppendLine("Content-Type: image/pjpeg");
        postData.AppendLine("");

        StringBuilder postData2 = new StringBuilder();
        postData2.AppendLine("\n--" + boundry);
        postData2.AppendLine("Content-Disposition: form-data; name=\"pict_to\"");
        postData2.AppendLine("");
        postData2.AppendLine("0");
        postData2.AppendLine("--" + boundry);
        postData2.AppendLine("Content-Disposition: form-data; name=\"pict_type\"");
        postData2.AppendLine("");
        postData2.AppendLine("0");
        postData2.AppendLine("--" + boundry + "--");

        Byte[] fileOpen = File.ReadAllBytes("C:/pic.jpeg");
        byte[] buffer = Encoding.ASCII.GetBytes(postData.ToString());
        byte[] buffer2 = Encoding.ASCII.GetBytes(postData2.ToString());

        HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://poster.decaptcher.com/");

        request.ContentType = "multipart/form-data; boundary=" + boundry;
        request.ContentLength = buffer.Length + buffer2.Length + fileOpen.Length;
        request.Method = "POST";

        String source = "";

        using (Stream PostData = request.GetRequestStream())
        {
            PostData.Write(buffer, 0, buffer.Length);
            PostData.Write(fileOpen, 0, fileOpen.Length);
            PostData.Write(buffer2, 0, buffer2.Length);

            using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
            {
                Byte[] rBuf = new Byte[8192];
                Stream resStream = response.GetResponseStream();
                string tmpString = null;
                int count = 0;
                do
                {
                    count = resStream.Read(rBuf, 0, rBuf.Length);
                    if (count != 0)
                    {
                        tmpString = Encoding.ASCII.GetString(rBuf, 0, count);
                        source += tmpString;
                    }
                } while (count > 0);

            }
        }
        MessageBox.Show(source);
        // Do something with the source
        return source;
    }

If you have a deCaptcher account, test it yourself. If need be I will post a video of it working, just to prove my point.

4
  • Encoding.ASCII will give you back your bytes unchanged. Which kind of proves our point: your data is not encoded in UTF-8 – in fact, UTF-8 encoding is used nowhere in your code. Furthermore, you are now reading the actual image data directly as bytes, without ever converting them (which is good). So where is that alleged UTF-8 encoded picture? Commented Feb 21, 2010 at 13:42
  • Sorry lol. I just realized when I posted it. I had to edit it. Yeah you were right, the UTF8 was messing it up. Commented Feb 21, 2010 at 13:44
  • By the way, nobody called you stupid – we merely pointed out the fact that you seem to not know about the proper encoding of data, and that your approach was therefore wrong. Commented Feb 21, 2010 at 13:44
  • "Converting it to a string caused problems, but writting it directly to the request stream worked." - you mean exactly what I had in my answer? "... to retrieve a stream you can write your data to." In your working code you are writing the data directly from the byte array to the request, exactly as I said to.
    – Jon Skeet
    Commented Feb 21, 2010 at 15:30

Not the answer you're looking for? Browse other questions tagged or ask your own question.