XmlWriter, Strings and Byte Order Marks

February 04, 2007 • from Maui, Hawaii • 17 comments

On this page:

UTF-8 with an XmlWriter (or even HtmlTextWriter for that matter) can sometimes be tricky if you’re sending output back into anything but a file. If you write data to a string or data to a stream that gets immediately fed into an output stream in a Web application or a POST buffer for an HTTP request you might find that the formatting of the XML generated usually will blow up.

Typically you have code like this:

MemoryStream ms = new MemoryStream();

XmlTextWriter writer = new XmlTextWriter(ms,Encoding.UTF8);

writer.Formatting = Formatting.Indented;

writer.WriteStartDocument();

writer.WriteStartElement("OFX");

this.CreateSignOnMessage(writer);

this.CreateInvestmentMessage(writer);

writer.WriteEndElement(); // OFX

writer.Close();

If you now take this XML and write this out to string for example you can do:

this.RequestXml = Encoding.Default.GetString(ms.ToArray());

to get a string that contains the UTF-8 encoded string (ie. it has funky characters for any extended over 128 char values). But the string generated has a problem and one that you might easily miss in that it contains a Byte Order Mark (BOM) at the beginning of it:

ï»¿<?xml version="1.0" encoding="utf-8"?>

Byte order marks are usually used for UTF-8 encoded files that are stored on disk, but if you send an XML response back from a Web Request or you store an XML document as text somewhere you typically don’t want this byte order mark at the front. The same issue applies if you use the stream directly to fire into the HTTP output stream in ASP.NET or as a POST buffer in a WebRequest POST request. If output goes anywhere but to file you typically don't want that BOM at the begging of the output.

It’s not real obvious how to get rid of the BOM either – You figure the XmlWriter would have an option for this, but the byte order mark usage is determined by the Encoding instance.

The default Encoding.UTF8 encoding has the the byte order mark enabled and you can’t turn it of. Instead if you want to generate XML without the BOM you have to create a new encoding and pass it into the XMLTextWriter like this:

// *** Create encoding manually in order not to

// *** create leading Byte order marks

Encoding Utf8 = new UTF8Encoding(false);

MemoryStream ms = new MemoryStream();

XmlTextWriter writer = new XmlTextWriter(ms,Utf8);

The BOM coding can only be specified in the constructor and the default Encoding.UTF8 is set to include the BOM so your only option is to override and create a new one.

Now, converting XML to string is usually not a good idea and should be avoided whenever possible. Rather keeping XML in stream or byte format and then loading it back into an XmlReader or XmlDocument is preferrable, but sometimes string storage is required such as in this older application I’m using.

The problem with strings is the encoding of course. Xml is usually UTF-8 encoded, so notice that I have to decide whether I want to retrieve the data as Unicode (use Encoding.UTF8 to decode to get the original data back and which effectively turns the XML document into UTF-16) or as 'encoded' Unicode string that pretends to be UTF-8 (use Encoding.Default to retrieve the funky UTF-8 markup characters). It gets confusing quickly even without the byte order marks involved. String encodings are no fun to deal with and if you can help it avoid encoding and recoding and pretzling your brain <s>.

Looking at how data the data in the existing application in the database already is structured it includes the UTF8 encoding in the stored content – the app takes that data and fires it off via HTTP to a background service application that processes it at a later point in time. <shrug> I’m stuck with this but ideally this should probably be stored as binary and then later just sent of into the WebRequets POST input stream. But using the string with this UTF-8 encoding works as well although it feels wrong <s>… so it goes with legacy code…

Incidentally it took me a while to figure out why the server I was eventually POSTing the data to was failing. It kept erroring out with Bad Request errors. When I picked up the log data the data looked fine. I went as far as even using Beyond Compare to check two responses and they were identical. Not until I hooked up an Fiddler to look at the raw HTTP response did I notice the damn Byte Order Marks. <s>

The Voices of Reason

Peter Bromberg
February 04, 2007

# re: XmlWriter, Strings and Byte Order Marks

Funny, I just recently had to deal with this and since I was working with the string representation from WebClient.DownloadString before I loaded it into an XmlDocument for further processing, I just decided to do something like
if(strXml.IndexOf("<") >0) and chopped off the byte order mark if the opening tag index was > 0. Surgical Precision!
Cheers

alextansc
February 04, 2007

# re: XmlWriter, Strings and Byte Order Marks

I happened to run into the same kind funky requirements where I need UTF-8 XML in a string. I found a different way of handling the MemoryStream though:

//
//Pretty much the same code up to your example up here
//

//Here's where it's different.
ms.Position = 0;

StreamReader sr = new StreamReader(ms);

string strOutput = sr.ReadToEnd();

As far as I can tell, I've not encountered the BOM issue in my string.

Alan
February 05, 2007

# re: XmlWriter, Strings and Byte Order Marks

The documentation for the UTF8Encoding class claims that the default constructor for UTF8Encoding does NOT include the byte order mark. I wonder if the Encoding.UTF8 shared property does not use the default constructor....

http://msdn2.microsoft.com/en-us/library/s756abs9.aspx

Alan

Milan Negovan
February 05, 2007

# re: XmlWriter, Strings and Byte Order Marks

Thanks for the tip, Rick! Been bitten by this before, too.

Hemant
May 15, 2007

# re: XmlWriter, Strings and Byte Order Marks

Rick,

I am using HttpWebRequest to call web service method since i want to use HttpDigest authentication i can not use proxy of the same. i am sending a XML string to the method. i am getting the same 400 Bad Request error. I have tried the solution that u mentioned and checked the index of "<" it is 0. but still getting the same error.

Could you please help me in this?

Thanks in advance.
Hemant

Rick Strahl
May 16, 2007

# re: XmlWriter, Strings and Byte Order Marks

Hemant, most likely your string is badly formatted. Digest Authentication should be supported by the .NET Web Service Proxy simply by passing the Windows client credentials or creating custom credentials.

This should help:
http://www.15seconds.com/issue/020312.htm

Rick Strahl's Web Log
June 23, 2007

# Rick Strahl's Web Log

Keith
August 03, 2007

# re: XmlWriter, Strings and Byte Order Marks

I really appreciate your help on this. I've been having this problem for a couple hours and you just made my day.

Kunal
December 26, 2007

# re: XmlWriter, Strings and Byte Order Marks

What if you are getting the xml from some other source? then how will you identify if there is BOM exists in the xml and how will you handle that?

Nick
June 12, 2008

# re: XmlWriter, Strings and Byte Order Marks

Thanks! I can't tell you how many times I"ve been burned by the BOM! You just saved me another hour of pounding my head into the desk.

David
September 29, 2008

# re: XmlWriter, Strings and Byte Order Marks

Exactly that I was looking for...
Thanks!!

emmett
January 14, 2010

# re: XmlWriter, Strings and Byte Order Marks

this was a huge help, thanks a lot!

Thys
March 26, 2010

# re: XmlWriter, Strings and Byte Order Marks

As always, lifesaver!

Bob Peterson
June 15, 2010

# re: XmlWriter, Strings and Byte Order Marks

This helped, so thanks. Of course there is a MS bug under all this. For my unit tests I write XML to a MemoryStream using Encoding.UTF8. Later I convert the resulting byte[] to a string using Encoding.UTF8.GetString(). But that PRESERVES the BOM when it produces the string, which is entirely incorrect as the string is an internal _character_ representation already (modelled) in Unicode! So no encoding is needed, so no BOM is needed. Had Microsoft not made this blunder I would never have noticed (or cared) there was a BOM in the MemoryStream.

John Doe
June 16, 2010

# re: XmlWriter, Strings and Byte Order Marks

Thanks for the tip. Worked great.

Chris Bohling
July 30, 2010

# re: XmlWriter, Strings and Byte Order Marks

Thanks! This was a very helpful post, and the only one I've found so far that directly addresses the BOM issue.

Spam
June 14, 2016

# re: XmlWriter, Strings and Byte Order Marks

Cheers Rick. You would probably believe the number of blind alleys I've been led down to sort this. Thought I was on a Close Out. Happy Daze.

Rick Strahl's Weblog