Contact   •   Products   •   Search

Rick Strahl's Web Log

Wind, waves, code and everything in between...
ASP.NET • C# • HTML5 • JavaScript • AngularJs

XmlWriter, Strings and Byte Order Marks


UTF-8 with an XmlWriter (or even HtmlTextWriter for that matter) can sometimes be tricky if you’re sending output back into anything but a file. If you write data to a string or data to a stream that gets immediately fed into an output stream in a Web application or a POST buffer for an HTTP request you might find that the formatting of the XML generated usually will blow up.

 

Typically you have code like this:

 

MemoryStream ms = new MemoryStream();

            XmlTextWriter writer = new XmlTextWriter(ms,Encoding.UTF8);

 

            writer.Formatting = Formatting.Indented;

            writer.WriteStartDocument();    

            writer.WriteStartElement("OFX");

 

            this.CreateSignOnMessage(writer);

            this.CreateInvestmentMessage(writer);

 

            writer.WriteEndElement(); // OFX

            writer.Close();

 

If you now take this XML and write this out to string for example you can do:

 

this.RequestXml = Encoding.Default.GetString(ms.ToArray());           

 

to get a string that contains the UTF-8 encoded string (ie. it has funky characters for any extended over 128 char values). But the string generated has a problem and one that you might easily miss in that it contains a Byte Order Mark (BOM) at the beginning of it:

 

<?xml version="1.0" encoding="utf-8"?>

 

Byte order marks are usually used for UTF-8 encoded files that are stored on disk, but if you send an XML response back from a Web Request or you store an XML document as text somewhere you typically don’t want this byte order mark at the front. The same issue applies if you use the stream directly to fire into the HTTP output stream in ASP.NET or as a POST buffer in a WebRequest POST request. If output goes anywhere but to file you typically don't want that BOM at the begging of the output.

 

It’s not real obvious how to get rid of the BOM either – You figure the XmlWriter would have an option for this, but the byte order mark usage is determined by the Encoding instance.

 

The default Encoding.UTF8 encoding has the the byte order mark enabled and you can’t turn it of. Instead if you want to generate XML without the BOM you have to create a new encoding and pass it into the XMLTextWriter like this:

 

            // *** Create encoding manually in order not to

            // *** create leading Byte order marks

            Encoding Utf8 = new UTF8Encoding(false);

 

            MemoryStream ms = new MemoryStream();

            XmlTextWriter writer = new XmlTextWriter(ms,Utf8);

 

The BOM coding can only be specified in the constructor and the default Encoding.UTF8 is set to include the BOM so your only option is to override and create a new one.

 

Now, converting XML to string is usually not a good idea and should be avoided whenever possible. Rather keeping XML in stream or byte format and then loading it back into an XmlReader or XmlDocument is preferrable, but sometimes string storage is required such as in this older application I’m using.

 

The problem with strings is the encoding of course. Xml is usually UTF-8 encoded, so notice that I have to decide whether I want to retrieve the data as Unicode (use Encoding.UTF8 to decode to get the original data back and which effectively turns the XML document into UTF-16) or as 'encoded' Unicode string that pretends to be UTF-8 (use Encoding.Default to retrieve the funky UTF-8 markup characters). It gets confusing quickly even without the byte order marks involved. String encodings are no fun to deal with and if you can help it avoid encoding and recoding and pretzling your brain <s>.

 

Looking at how data the data in the existing application in the database already is structured it includes the UTF8 encoding in the stored content – the app takes that data and fires it off via HTTP to a background service application that processes it at a later point in time.  <shrug> I’m stuck with this but ideally this should probably be stored as binary and then later just sent of into the WebRequets POST input stream. But using the string with this UTF-8 encoding works as well although it feels wrong <s>… so it goes with legacy code…

 

Incidentally it took me a while to figure out why the server I was eventually POSTing the data to was failing. It kept erroring out with Bad Request errors. When I picked up the log data the data looked fine. I went as far as even using Beyond Compare to check two responses and they were identical. Not until I hooked up an Fiddler to look at the raw HTTP response did I notice the damn Byte Order Marks. <s>

Make Donation
Posted in .NET  XML  


Feedback for this Post

 
# re: XmlWriter, Strings and Byte Order Marks
by Peter Bromberg February 04, 2007 @ 5:21pm
Funny, I just recently had to deal with this and since I was working with the string representation from WebClient.DownloadString before I loaded it into an XmlDocument for further processing, I just decided to do something like
if(strXml.IndexOf("<") >0) and chopped off the byte order mark if the opening tag index was > 0. Surgical Precision!
Cheers
# re: XmlWriter, Strings and Byte Order Marks
by alextansc February 04, 2007 @ 11:41pm
I happened to run into the same kind funky requirements where I need UTF-8 XML in a string. I found a different way of handling the MemoryStream though:
//
//Pretty much the same code up to your example up here
//

//Here's where it's different.
ms.Position = 0;

StreamReader sr = new StreamReader(ms);

string strOutput = sr.ReadToEnd(); 


As far as I can tell, I've not encountered the BOM issue in my string.
# re: XmlWriter, Strings and Byte Order Marks
by Alan February 05, 2007 @ 7:20am
The documentation for the UTF8Encoding class claims that the default constructor for UTF8Encoding does NOT include the byte order mark. I wonder if the Encoding.UTF8 shared property does not use the default constructor....

http://msdn2.microsoft.com/en-us/library/s756abs9.aspx

Alan
# re: XmlWriter, Strings and Byte Order Marks
by Milan Negovan February 05, 2007 @ 7:55am
Thanks for the tip, Rick! Been bitten by this before, too.
# re: XmlWriter, Strings and Byte Order Marks
by Hemant May 15, 2007 @ 11:21pm
Rick,

I am using HttpWebRequest to call web service method since i want to use HttpDigest authentication i can not use proxy of the same. i am sending a XML string to the method. i am getting the same 400 Bad Request error. I have tried the solution that u mentioned and checked the index of "<" it is 0. but still getting the same error.

Could you please help me in this?

Thanks in advance.
Hemant
# re: XmlWriter, Strings and Byte Order Marks
by Rick Strahl May 16, 2007 @ 1:06am
Hemant, most likely your string is badly formatted. Digest Authentication should be supported by the .NET Web Service Proxy simply by passing the Windows client credentials or creating custom credentials.

This should help:
http://www.15seconds.com/issue/020312.htm
# DotNetSlackers: XmlWriter, Strings and Byte Order Marks
by DotNetSlackers Latest ASP.NET News May 16, 2007 @ 9:08pm
# Christopher Miller's random thoughts: Generating UTF-8 string s with out Byte Order Marks
by Christopher Miller's random thoughts May 17, 2007 @ 7:01pm
# Rick Strahl's Web Log
by Rick Strahl's Web Log June 23, 2007 @ 6:02pm
# re: XmlWriter, Strings and Byte Order Marks
by Keith August 03, 2007 @ 9:07pm
I really appreciate your help on this. I've been having this problem for a couple hours and you just made my day.
# re: XmlWriter, Strings and Byte Order Marks
by Kunal December 26, 2007 @ 4:34am
What if you are getting the xml from some other source? then how will you identify if there is BOM exists in the xml and how will you handle that?
# re: XmlWriter, Strings and Byte Order Marks
by Nick June 12, 2008 @ 12:19pm
Thanks! I can't tell you how many times I"ve been burned by the BOM! You just saved me another hour of pounding my head into the desk.
# business opportunity leads
by business opportunity leads September 10, 2008 @ 9:56am
So, what did we see these last few days? We’ ve found that several users would like to customize their own Accounts and Categories their own liking. It’ s as easy as a click of a button!
# re: XmlWriter, Strings and Byte Order Marks
by David September 29, 2008 @ 11:59pm
Exactly that I was looking for...
Thanks!!
# re: XmlWriter, Strings and Byte Order Marks
by emmett January 14, 2010 @ 3:55pm
this was a huge help, thanks a lot!
# re: XmlWriter, Strings and Byte Order Marks
by Thys March 26, 2010 @ 9:36am
As always, lifesaver!
# re: XmlWriter, Strings and Byte Order Marks
by Bob Peterson June 15, 2010 @ 7:15am
This helped, so thanks. Of course there is a MS bug under all this. For my unit tests I write XML to a MemoryStream using Encoding.UTF8. Later I convert the resulting byte[] to a string using Encoding.UTF8.GetString(). But that PRESERVES the BOM when it produces the string, which is entirely incorrect as the string is an internal _character_ representation already (modelled) in Unicode! So no encoding is needed, so no BOM is needed. Had Microsoft not made this blunder I would never have noticed (or cared) there was a BOM in the MemoryStream.
# re: XmlWriter, Strings and Byte Order Marks
by John Doe June 16, 2010 @ 7:52am
Thanks for the tip. Worked great.
# re: XmlWriter, Strings and Byte Order Marks
by Chris Bohling July 30, 2010 @ 10:49am
Thanks! This was a very helpful post, and the only one I've found so far that directly addresses the BOM issue.
 


West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2014