Strip the byte order mark from string in C#

Question

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.

So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using

if (xml.StartsWith(ByteOrderMarkUtf8))
{
    xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}

but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

Peter Mortensen · Accepted Answer · 2022-02-18 20:38:15Z

75

I recently had issues with the .NET 4 upgrade, but until then the simple answer is

String.Trim()

removes the BOM up until .NET 3.5.

However, in .NET 4 you need to change it slightly:

String.Trim(new char[]{'\uFEFF'});

That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):

String.Trim(new char[]{'\uFEFF','\u200B'});

This you could also use to remove other unwanted characters.

Some further information is from String.Trim Method:

The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

edited Feb 18, 2022 at 20:38

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Feb 4, 2011 at 16:59

PJUK

1,8271 gold badge17 silver badges21 bronze badges

1

Sorry, your example does not appear to work. Try it with string "\x00EF\x00BB\x00BF<xml/>" under .NET 4.
– TrueWill
Commented Feb 4, 2011 at 18:14
Didn't completely understand the question I've had trouble with the standard BOM and didnt even recognise the \x00EF\x00BB\x00BF madness you had to deal with
– PJUK
Commented Dec 14, 2011 at 13:34
3

Isn't '\uFEFF' the BOM for UTF16, rather than UTF8?
– Cocowalla
Commented May 18, 2013 at 19:06
1

You know, you're right there, I've never had trouble with the UTF8 BOM (which is on reflection what the question asked - that is indeed the UTF8 one) the UTF16 BOM is what I was having trouble with at the time.
– PJUK
Commented Jul 2, 2013 at 12:04
1

@Cocowalla The corresponding bytes are FEFF in big-endian UTF16, yes, but the preamble character is the same in all encodings.
– Nyerguds
Commented Jan 2, 2017 at 9:20

Add a comment |

Community · Accepted Answer · 2017-05-23 12:18:29Z

56

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:

private readonly string _byteOrderMarkUtf8 =
    Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());

public string GetXmlResponse(Uri resource)
{
    string xml;

    using (var client = new WebClient())
    {
        client.Encoding = Encoding.UTF8;
        xml = client.DownloadString(resource);
    }

    if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
    {
        xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
    }

    return xml;
}

Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Aug 23, 2009 at 18:38

TrueWill

25.4k10 gold badges102 silver badges155 bronze badges

3

Does not seem to work for me. Even "".StartsWith(_byteOrderMarkUtf8) returns true
– pingo
Commented May 21, 2015 at 8:16
1

@pingo Just tried your code in LINQPad 4 and it returned False.
– TrueWill
Commented May 25, 2015 at 15:02
2

Surprisingly, there's an implementation difference in the StartsWith method that produces different results on different operating systems. See stackoverflow.com/questions/19495318/…
– Rami A.
Commented Apr 14, 2017 at 18:56
3

@TrueWill, yes. Otherwise, the results are different when run on Windows 7 vs. Windows 8 or Windows Server 2012 for example.
– Rami A.
Commented Apr 17, 2017 at 4:17
3

This is the only approach that worked for me. I used string.Replace() to replace the BOM. Thanks
– Daniel Leiszen
Commented Nov 25, 2022 at 16:12

| Show 3 more comments

Vivek Ayer · Accepted Answer · 2010-07-19 16:22:54Z

33

This works as well

int index = xmlResponse.IndexOf('<');
if (index > 0)
{
    xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}

answered Jul 19, 2010 at 16:22

Vivek Ayer

1,13511 silver badges13 bronze badges

1

Looks simple to me, solved my problem and I think it will solve for other encodings too
– Davi Fiamenghi
Commented Jan 12, 2012 at 16:16
Hi Vivek, could you visit the Tridion StackExchange proposal when you have a minute please? area51.stackexchange.com/proposals/38335/tridion We believe the commitment score requires visits from time to time and so is not including you in "users with > 200 rep" figure. Thanks!
– Rob Stevenson-Leggett
Commented Apr 11, 2012 at 7:17
3

this code deserves to be put in a frame, WTF! typical from my consulting days... Please rather use @PJUK solution
– knocte
Commented Nov 7, 2012 at 18:02
I had an invisible crap character at the beginning of my string and end, so I had to do the code presented here as well as something similar to the end of the string: int closingBracket = result.LastIndexOf('>'); if (result.Length > closingBracket + 1) result = result.Remove(closingBracket + 1);
– John Gilmer
Commented Mar 20, 2019 at 18:30

Add a comment |

ProgrammingLlama · Accepted Answer · 2023-06-20 02:14:44Z

27

A quick and simple method to remove it directly from a string:

private static string RemoveBom(string p)
{
     string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
     if (p.StartsWith(BOMMarkUtf8, StringComparison.Ordinal))
         p = p.Remove(0, BOMMarkUtf8.Length);
     return p.Replace("\0", "");
}

How to use it:

string yourCleanString=RemoveBom(yourBOMString);

Note that StringComparison.Ordinal is important as, depending on the culture the thread is running under, the BOM can be interpreted as an empty string by StartsWith and will always return true. Ordinal will compare the string using binary sort rules.

edited Jun 20, 2023 at 2:14

ProgrammingLlama

38k7 gold badges73 silver badges96 bronze badges

answered Mar 25, 2013 at 13:21

Tiago Gouvêa

16.2k5 gold badges79 silver badges83 bronze badges

1

In my case I needed to strip a UTF-16 BOM. Changing 'Encoding.UTF8' to 'Encoding.Unicode' in the method worked for me.
– Brad J
Commented Jan 27, 2015 at 16:12
This is effectively the same as @TrueWill 's answer.
– Matthew Dresser
Commented Sep 19, 2016 at 13:14
1

It's not @MatthewDresser. It's smaller, simpler and clean. :)
– Tiago Gouvêa
Commented Sep 19, 2016 at 13:25

Add a comment |

Peter Mortensen · Accepted Answer · 2022-02-21 01:41:47Z

22

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.

Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

edited Feb 21, 2022 at 1:41

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Aug 23, 2009 at 4:48

Martin v. Löwis

127k20 gold badges202 silver badges236 bronze badges

13

XDocument.Parse does not have an overload that accepts a byte array. I find the statement "you did something wrong" condescending. I would have expected DownloadString to detect the BOM and select the correct encoding.
– TrueWill
Commented Aug 23, 2009 at 16:42
4

I think you can get the XDocument also through .Load, passing an XmlReader, which you can get by passing a Stream, for which you can use a MemoryStream. I didn't mean to be condescending; I only tried to point out that the intermediate result that you got is seemingly incorrect, so that the real problem is not that you have to strip those characters, but that they are present in the first place. Perhaps it is the case that there is a flaw in DownloadString, in which case you shouldn't be using it. Perhaps the flaw is in the web server reporting the wrong charset.
– Martin v. Löwis
Commented Aug 23, 2009 at 21:38
OK, thanks. I did find I didn't have the client Encoding set correctly for DownloadString, which gave me a single code point (as you mentioned). It's somewhat moot at this point, as the company providing the "REST" service decided to remove the redundant (for XML in utf-8) BOM.
– TrueWill
Commented Aug 24, 2009 at 18:14
1

good call. Using XDocument.Load worked out quite well for me. It's not necessary to use the XmlReader, though, as XDocument.Load takes a stream for an argument.
– Steven Oxley
Commented Oct 27, 2010 at 22:10

Add a comment |

Peter Mortensen · Accepted Answer · 2022-02-18 20:34:54Z

12

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:

var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);

It's that simple.

If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):

var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);

edited Feb 18, 2022 at 20:34

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Oct 27, 2010 at 22:15

Steven Oxley

6,6736 gold badges43 silver badges57 bronze badges

2

This worked great for me but I had to add an intermediary StreamReader
– ScottB
Commented Nov 23, 2010 at 22:14
ie. var doc = XDocument.Load(new StreamReader(new MemoryStream(batchfile)));
– ScottB
Commented Nov 23, 2010 at 22:15
Me too, Steven's code doesn't compile. There is no overload of XDocument.Load() that takes a Stream.
– Chris Wenham
Commented Jul 15, 2011 at 17:33
2

Here is the documentation for the XDocument.Load(Stream) overload: msdn.microsoft.com/en-us/library/cc838349.aspx. I guess it's specific to .NET 4, so you must be using .NET 3.5. In that case you would have to use a different overload.
– Steven Oxley
Commented Jul 19, 2011 at 15:29

Add a comment |

Andrew Thompson · Accepted Answer · 2011-02-20 21:02:24Z

8

I wrote the following post after coming across this issue.

Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.

answered Feb 20, 2011 at 21:02

Andrew Thompson

2,4241 gold badge22 silver badges23 bronze badges

That link is dead. Please avoid writing answers that only link to external resources. Include the link and the relevant sections
– Raniz
Commented Dec 28, 2023 at 8:55

Add a comment |

Nicholas Petersen · Accepted Answer · 2019-04-10 23:25:43Z

It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.

Usage:

            string feed = ""; // input
            bool hadBOM = FixBOMIfNeeded(ref feed);

            var xElem = XElement.Parse(feed); // now does not fail

    /// <summary>
    /// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
    /// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
    /// </summary>
    public const char BOMChar = (char)65279;

    public static bool FixBOMIfNeeded(ref string str)
    {
        if (string.IsNullOrEmpty(str))
            return false;

        bool hasBom = str[0] == BOMChar;
        if (hasBom)
            str = str.Substring(1);

        return hasBom;
    }

Worked as expected.
– Jitendra Pancholi
Commented Mar 22, 2019 at 7:13 — Jitendra Pancholi, Commented Mar 22, 2019 at 7:13

Peter Mortensen · Accepted Answer · 2022-02-18 20:32:12Z

5

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.

Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

edited Feb 18, 2022 at 20:32

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Aug 23, 2009 at 4:49

Andrew Arnott

81.2k28 gold badges133 silver badges178 bronze badges

1

Thank you for your response; unfortunately this did not work. I used DownloadData and that worked; however, Encoding.UTF8.GetString(byte[]) did not strip the BOM. I tried variants with new UTF8Encoding(false) and (true) without success. Please note that this is UTF-8 data - encoding="utf-8" is specified in the XML header, and it parses correctly once the BOM is removed.
– TrueWill
Commented Aug 23, 2009 at 16:47
Interesting. I was going to mark this down because I'd been using UTF8Encoding.ASCII.GetString(bytes) which leaves the BOM in but Encoding.UTF8.GetString(bytes) removes it. Upvoted instead
– Carl Onager
Commented Oct 22, 2012 at 14:04
In my tests, both Encoding.UTF8.GetString(byte[] s) and new UTF8Encoding(encoderShouldEmitUTF8Identifier: false).GetString(byte[] s) do not trim BOM.
– Yan F.
Commented Dec 2, 2019 at 2:56

Add a comment |

ProgrammingLlama · Accepted Answer · 2023-06-20 02:08:09Z

I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):

public static string GetUTF8String(byte[] data)
{
    byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
    if (data.StartsWith(utf8Preamble))
    {
        return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
    }
    else
    {
        return Encoding.UTF8.GetString(data);
    }
}

Where StartsWith(byte[]) is the logical extension:

public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
   // Handle invalid/unexpected input
   // (nulls, thisArray.Length < otherArray.Length, etc.)

   for (int i = 0; i < otherArray.Length; ++i)
   {
       if (thisArray[i] != otherArray[i])
       {
           return false;
       }
   }

   return true;
}

I don't see anything restricting the concept here to UTF-8. Since GetPreamble() belongs to Encoding, it should be possible to genericize to take in the Encoding as a parameter. — Timothy, Commented Mar 20, 2015 at 21:43

siva.k · Accepted Answer · 2014-08-28 13:48:43Z

2

StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);

edited Aug 28, 2014 at 13:48

siva.k

1,34414 silver badges24 bronze badges

answered Aug 28, 2014 at 13:42

lucasjam

211 bronze badge

4

How does this solve the problem? Can you expand upon it at all?
– siva.k
Commented Aug 28, 2014 at 13:48
StreamReader() will handle the BOM.
– Mike S
Commented Dec 30, 2015 at 0:02

Add a comment |

Vinicius · Accepted Answer · 2019-08-28 19:07:12Z

1

Yet another generic variation to get rid of the UTF-8 BOM preamble:

var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
    preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);

answered Aug 28, 2019 at 19:07

Vinicius

1,66119 silver badges19 bronze badges

Add a comment |

Peter Mortensen · Accepted Answer · 2022-02-18 20:46:26Z

0

Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:

certficateThumbprint = Regex.Replace(certficateThumbprint, @"[^a-zA-Z0-9\-\s*]", "");

And there you go. Voila!! It worked for me.

edited Feb 18, 2022 at 20:46

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Jul 24, 2020 at 15:16

Alexander Immanuel D

1062 bronze badges

Add a comment |

Oleg Polezky · Accepted Answer · 2019-11-09 09:46:08Z

-1

I solved the issue with the following code

using System.Xml.Linq;

void method()
{
    byte[] bytes = GetXmlBytes();
    XDocument doc;
    using (var stream = new MemoryStream(docBytes))
    {
        doc = XDocument.Load(stream);
    }
 }

answered Nov 9, 2019 at 9:46

Oleg Polezky

1,09415 silver badges14 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Strip the byte order mark from string in C#

14 Answers 14

Not the answer you're looking for? Browse other questions tagged
c#
string
encoding
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

14 Answers 14

Not the answer you're looking for? Browse other questions tagged c#stringencoding or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c#
string
encoding
or ask your own question.