57

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.

So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using

if (xml.StartsWith(ByteOrderMarkUtf8))
{
    xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}

but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

14 Answers 14

75

I recently had issues with the .NET 4 upgrade, but until then the simple answer is

String.Trim()

removes the BOM up until .NET 3.5.

However, in .NET 4 you need to change it slightly:

String.Trim(new char[]{'\uFEFF'});

That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):

String.Trim(new char[]{'\uFEFF','\u200B'});

This you could also use to remove other unwanted characters.

Some further information is from String.Trim Method:

The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

5
  • 1
    Sorry, your example does not appear to work. Try it with string "\x00EF\x00BB\x00BF<xml/>" under .NET 4.
    – TrueWill
    Commented Feb 4, 2011 at 18:14
  • Didn't completely understand the question I've had trouble with the standard BOM and didnt even recognise the \x00EF\x00BB\x00BF madness you had to deal with
    – PJUK
    Commented Dec 14, 2011 at 13:34
  • 3
    Isn't '\uFEFF' the BOM for UTF16, rather than UTF8?
    – Cocowalla
    Commented May 18, 2013 at 19:06
  • 1
    You know, you're right there, I've never had trouble with the UTF8 BOM (which is on reflection what the question asked - that is indeed the UTF8 one) the UTF16 BOM is what I was having trouble with at the time.
    – PJUK
    Commented Jul 2, 2013 at 12:04
  • 1
    @Cocowalla The corresponding bytes are FEFF in big-endian UTF16, yes, but the preamble character is the same in all encodings.
    – Nyerguds
    Commented Jan 2, 2017 at 9:20
56

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:

private readonly string _byteOrderMarkUtf8 =
    Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());

public string GetXmlResponse(Uri resource)
{
    string xml;

    using (var client = new WebClient())
    {
        client.Encoding = Encoding.UTF8;
        xml = client.DownloadString(resource);
    }

    if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
    {
        xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
    }

    return xml;
}

Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

8
  • 3
    Does not seem to work for me. Even "".StartsWith(_byteOrderMarkUtf8) returns true
    – pingo
    Commented May 21, 2015 at 8:16
  • 1
    @pingo Just tried your code in LINQPad 4 and it returned False.
    – TrueWill
    Commented May 25, 2015 at 15:02
  • 2
    Surprisingly, there's an implementation difference in the StartsWith method that produces different results on different operating systems. See stackoverflow.com/questions/19495318/…
    – Rami A.
    Commented Apr 14, 2017 at 18:56
  • 3
    @TrueWill, yes. Otherwise, the results are different when run on Windows 7 vs. Windows 8 or Windows Server 2012 for example.
    – Rami A.
    Commented Apr 17, 2017 at 4:17
  • 3
    This is the only approach that worked for me. I used string.Replace() to replace the BOM. Thanks Commented Nov 25, 2022 at 16:12
33

This works as well

int index = xmlResponse.IndexOf('<');
if (index > 0)
{
    xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}
4
  • 1
    Looks simple to me, solved my problem and I think it will solve for other encodings too Commented Jan 12, 2012 at 16:16
  • Hi Vivek, could you visit the Tridion StackExchange proposal when you have a minute please? area51.stackexchange.com/proposals/38335/tridion We believe the commitment score requires visits from time to time and so is not including you in "users with > 200 rep" figure. Thanks! Commented Apr 11, 2012 at 7:17
  • 3
    this code deserves to be put in a frame, WTF! typical from my consulting days... Please rather use @PJUK solution
    – knocte
    Commented Nov 7, 2012 at 18:02
  • I had an invisible crap character at the beginning of my string and end, so I had to do the code presented here as well as something similar to the end of the string: int closingBracket = result.LastIndexOf('>'); if (result.Length > closingBracket + 1) result = result.Remove(closingBracket + 1); Commented Mar 20, 2019 at 18:30
27

A quick and simple method to remove it directly from a string:

private static string RemoveBom(string p)
{
     string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
     if (p.StartsWith(BOMMarkUtf8, StringComparison.Ordinal))
         p = p.Remove(0, BOMMarkUtf8.Length);
     return p.Replace("\0", "");
}

How to use it:

string yourCleanString=RemoveBom(yourBOMString);

Note that StringComparison.Ordinal is important as, depending on the culture the thread is running under, the BOM can be interpreted as an empty string by StartsWith and will always return true. Ordinal will compare the string using binary sort rules.

3
  • 1
    In my case I needed to strip a UTF-16 BOM. Changing 'Encoding.UTF8' to 'Encoding.Unicode' in the method worked for me.
    – Brad J
    Commented Jan 27, 2015 at 16:12
  • This is effectively the same as @TrueWill 's answer. Commented Sep 19, 2016 at 13:14
  • 1
    It's not @MatthewDresser. It's smaller, simpler and clean. :) Commented Sep 19, 2016 at 13:25
22

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.

Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

4
  • 13
    XDocument.Parse does not have an overload that accepts a byte array. I find the statement "you did something wrong" condescending. I would have expected DownloadString to detect the BOM and select the correct encoding.
    – TrueWill
    Commented Aug 23, 2009 at 16:42
  • 4
    I think you can get the XDocument also through .Load, passing an XmlReader, which you can get by passing a Stream, for which you can use a MemoryStream. I didn't mean to be condescending; I only tried to point out that the intermediate result that you got is seemingly incorrect, so that the real problem is not that you have to strip those characters, but that they are present in the first place. Perhaps it is the case that there is a flaw in DownloadString, in which case you shouldn't be using it. Perhaps the flaw is in the web server reporting the wrong charset. Commented Aug 23, 2009 at 21:38
  • OK, thanks. I did find I didn't have the client Encoding set correctly for DownloadString, which gave me a single code point (as you mentioned). It's somewhat moot at this point, as the company providing the "REST" service decided to remove the redundant (for XML in utf-8) BOM.
    – TrueWill
    Commented Aug 24, 2009 at 18:14
  • 1
    good call. Using XDocument.Load worked out quite well for me. It's not necessary to use the XmlReader, though, as XDocument.Load takes a stream for an argument. Commented Oct 27, 2010 at 22:10
12

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:

var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);

It's that simple.

If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):

var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);
4
  • 2
    This worked great for me but I had to add an intermediary StreamReader
    – ScottB
    Commented Nov 23, 2010 at 22:14
  • ie. var doc = XDocument.Load(new StreamReader(new MemoryStream(batchfile)));
    – ScottB
    Commented Nov 23, 2010 at 22:15
  • Me too, Steven's code doesn't compile. There is no overload of XDocument.Load() that takes a Stream. Commented Jul 15, 2011 at 17:33
  • 2
    Here is the documentation for the XDocument.Load(Stream) overload: msdn.microsoft.com/en-us/library/cc838349.aspx. I guess it's specific to .NET 4, so you must be using .NET 3.5. In that case you would have to use a different overload. Commented Jul 19, 2011 at 15:29
8

I wrote the following post after coming across this issue.

Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.

1
  • That link is dead. Please avoid writing answers that only link to external resources. Include the link and the relevant sections
    – Raniz
    Commented Dec 28, 2023 at 8:55
5

It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.

Usage:

            string feed = ""; // input
            bool hadBOM = FixBOMIfNeeded(ref feed);

            var xElem = XElement.Parse(feed); // now does not fail

    /// <summary>
    /// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
    /// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
    /// </summary>
    public const char BOMChar = (char)65279;

    public static bool FixBOMIfNeeded(ref string str)
    {
        if (string.IsNullOrEmpty(str))
            return false;

        bool hasBom = str[0] == BOMChar;
        if (hasBom)
            str = str.Substring(1);

        return hasBom;
    }
1
5

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.

Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

3
  • 1
    Thank you for your response; unfortunately this did not work. I used DownloadData and that worked; however, Encoding.UTF8.GetString(byte[]) did not strip the BOM. I tried variants with new UTF8Encoding(false) and (true) without success. Please note that this is UTF-8 data - encoding="utf-8" is specified in the XML header, and it parses correctly once the BOM is removed.
    – TrueWill
    Commented Aug 23, 2009 at 16:47
  • Interesting. I was going to mark this down because I'd been using UTF8Encoding.ASCII.GetString(bytes) which leaves the BOM in but Encoding.UTF8.GetString(bytes) removes it. Upvoted instead Commented Oct 22, 2012 at 14:04
  • In my tests, both Encoding.UTF8.GetString(byte[] s) and new UTF8Encoding(encoderShouldEmitUTF8Identifier: false).GetString(byte[] s) do not trim BOM.
    – Yan F.
    Commented Dec 2, 2019 at 2:56
3

I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):

public static string GetUTF8String(byte[] data)
{
    byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
    if (data.StartsWith(utf8Preamble))
    {
        return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
    }
    else
    {
        return Encoding.UTF8.GetString(data);
    }
}

Where StartsWith(byte[]) is the logical extension:

public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
   // Handle invalid/unexpected input
   // (nulls, thisArray.Length < otherArray.Length, etc.)

   for (int i = 0; i < otherArray.Length; ++i)
   {
       if (thisArray[i] != otherArray[i])
       {
           return false;
       }
   }

   return true;
}
1
  • I don't see anything restricting the concept here to UTF-8. Since GetPreamble() belongs to Encoding, it should be possible to genericize to take in the Encoding as a parameter.
    – Timothy
    Commented Mar 20, 2015 at 21:43
2
StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);
2
  • 4
    How does this solve the problem? Can you expand upon it at all?
    – siva.k
    Commented Aug 28, 2014 at 13:48
  • StreamReader() will handle the BOM.
    – Mike S
    Commented Dec 30, 2015 at 0:02
1

Yet another generic variation to get rid of the UTF-8 BOM preamble:

var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
    preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);
0

Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:

certficateThumbprint = Regex.Replace(certficateThumbprint, @"[^a-zA-Z0-9\-\s*]", "");

And there you go. Voila!! It worked for me.

-1

I solved the issue with the following code

using System.Xml.Linq;

void method()
{
    byte[] bytes = GetXmlBytes();
    XDocument doc;
    using (var stream = new MemoryStream(docBytes))
    {
        doc = XDocument.Load(stream);
    }
 }
0

Not the answer you're looking for? Browse other questions tagged or ask your own question.