0

To my understanding when I write a file using notepad++ I can write the symbols ’ and & without a problem in a text file. Both are valid ASCI symbols and they are not that exotic either. & = 38 decimal 26 in hex ’ = 44 decimal 2C in hex

I try to write both out in streamwriter (.net core) I used various text encodings, but somehow it fails One of them depending of the encoding gets broken to \uxxxx Is there an encoding type that works for both ?

my code


filedata = "& Test ’   ";
// filedata = filedata.Replace("\\u0026", "&"); //extra added should not be needed with Test
// filedata = filedata.Replace("\\u2019", "’"); //extra added should not be needed with Test

// Write the updated content back to the file with exclusive access
using (var fileStream = new FileStream(filePath, FileMode.Truncate, FileAccess.Write, FileShare.None)) {

    // I used various combinations for below, also Encoding.ASCI, ...UTF.. ,  Encoding.Asci  etc..
    using (var writer = new StreamWriter(fileStream, new UTF8Encoding(true))) {
        await writer.WriteAsync(filedata);
        }
    }

The other application needs it, I think its valid UTF-8 Bom which should not translate these symbols. But maybe its something else, notepad++ shows me in clear text the \uxxx when i write the file with C# While if i type the symbols in notepad++ and open the file i dont see the \uxxxx
I trust notepad++ a bit more.

Notably over the wire it all goes fine. It's the file saving causing me a headache literally.

9
  • Wait...what is the question?
    – Narish
    Commented Mar 7 at 15:59
  • Have you tried using one of the constructor overloads that doesn't take an Encoding at all? I've never personally seen \u escaping happen when doing simple file I/O unless I was using JSON-related APIs. Commented Mar 7 at 15:59
  • 1
    I can't reproduce this. If I copy & paste your code, the first byte after the BOM is indeed 38 (&), see dotnetfiddle.net/ikBSxO. I suspect the escaping you are seeing occurs because your viewer is somehow escaping the character, or because you are transmitting the file over HTML which introduces the escaping, or because you are working on some unusual .NET platform or version. Please edit your question to share a minimal reproducible example.
    – dbc
    Commented Mar 7 at 16:09
  • 2
    most certainly is not a simple character; in UTF-8 that's 3 bytes: E2-80-99; 44/0x2C is , i.e. comma Commented Mar 7 at 16:11
  • 4
    "One of them depending of the encoding gets broken to \uxxxx" - no, it doesn't - you're writing string data as UTF-8 bytes - that doesn't perform any \u escaping - it just writes the bytes; I think that \u is coming from whatever you're using to view the data. Honestly: when inspecting payloads, you need to look at the bytes with a hex viewer - nothing else will be useful. Commented Mar 7 at 16:15

1 Answer 1

1

I believe this is simply a misreading of the ASCII table; 44/0x2C is not , it is , (comma). Character (code-point 8217, sometimes called ’ in HTML etc) is a non-ASCII character, and in UTF-8 will be written with 3 bytes: E2-80-99. I suspect your text viewer is configured to interpret non-ASCII (or possibly just "not valid in the selected encoding") characters with \u escaping, but that's just the tool trying to help you; that isn't the actual bytes. To comment on the bytes, only a raw hex viewer will suffice.

2
  • I was visually matching and mistaken,, though my problem still stands, i think it should bell 7, or 39... frustrating, As IDE's have their formats as well.. HTML Json and files..pff, tomorrow is another day, another go for it.
    – Peter
    Commented Mar 7 at 22:52
  • @Peter single quote (39, ') is a simple character: should have no problem; bell (7, BEL) is a non-printable ASCII character, so it wouldn't surprise me if a text editor did something to highlight that it exists, such as writing \u07 - however once again: the only meaningful question is what are the bytes?, and for that: we need to use a hex viewer Commented Mar 8 at 7:07

Not the answer you're looking for? Browse other questions tagged or ask your own question.