Rick Strahl's Weblog  

Wind, waves, code and everything in between...
.NET • C# • Markdown • WPF • All Things Web
Contact   •   Articles   •   Products   •   Support   •   Advertise
Sponsored by:
Markdown Monster - The Markdown Editor for Windows

XmlWriter and lower ASCII characters


:P
On this page:

Ran into an interesting problem today on my CodePaste.net site: The main RSS and ATOM feeds on the site were broken because one code snippet on the site contained a lower ASCII character (CHR(3)). I don't think this was done on purpose but it was enough to make the feeds fail.

After quite a bit of debugging and throwing in a custom error handler into my actual feed generation code that just spit out the raw error instead of running it through the ASP.NET MVC and my own error pipeline I found the actual error.

The lovely base exception and error trace I got looked like this:

Error: '', hexadecimal value 0x03, is an invalid character.


at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd)
at System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
at System.Xml.XmlWellFormedWriter.WriteString(String text)
at System.Xml.XmlWriter.WriteElementString(String localName, String ns, String value)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteItemContents(XmlWriter writer, SyndicationItem item, Uri feedBaseUri)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteItem(XmlWriter writer, SyndicationItem item, Uri feedBaseUri)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteItems(XmlWriter writer, IEnumerable`1 items, Uri feedBaseUri)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteFeed(XmlWriter writer)
at System.ServiceModel.Syndication.Rss20FeedFormatter.WriteTo(XmlWriter writer)
at CodePasteMvc.Controllers.ApiControllerBase.GetFeed(Object instance) in C:\Projects2010\CodePaste\CodePasteMvc\Controllers\ApiControllerBase.cs:line 131

XML doesn't like extended ASCII Characters

It turns out the issue is that XML in general does not deal well with lower ASCII characters. According to the XML spec it looks like any characters below 0x09 are invalid. If you generate an XML document in .NET with an embedded  entity (as mine did to create the error above), you tend to get an XML document error when displaying it in a viewer. For example, here's what the result of my  feed output looks like with the invalid character embedded inside of Chrome which displays RSS feeds as raw XML by default:

ChromeError

Other browsers show similar error messages. The nice thing about Chrome is that you can actually view source and jump down to see the line that causes the error which allowed me to track down the actual message that failed.

If you create an XML document that contains a 0x03 character the XML writer fails outright with the error:

'', hexadecimal value 0x03, is an invalid character.

The good news is that this behavior is overridable so XML output can at least be created by using the XmlSettings object when configuring the XmlWriter instance. In my RSS configuration code this looks something like this:

MemoryStream ms = new MemoryStream();
var settings = new XmlWriterSettings()
{
    CheckCharacters = false
};
XmlWriter writer = XmlWriter.Create(ms,settings);

and voila the feed now generates.

Now generally this is probably NOT a good idea, because as mentioned above these characters are illegal and if you view a raw XML document you'll get validation errors. Luckily though most RSS feed readers however don't care and happily accept and display the feed correctly, which is good because it got me over an embarrassing hump until I figured out a better solution.

How to handle extended Characters?

I was glad to get the feed fixed for the time being, but now I was still stuck with an interesting dilemma. CodePaste.net accepts user input for code snippets and those code snippets can contain just about anything. This means that ASP.NET's standard request filtering cannot be applied to this content. The code content displayed is encoded before display so for the HTML end the CHR(3) input is not really an issue.

While invisible characters are hardly useful in user input it's not uncommon that odd characters show up in code snippets. You know the old fat fingering that happens when you're in the middle of a coding session and those invisible characters do end up sometimes in code editors and then end up pasted into the HTML textbox for pasting as a Codepaste.net snippet.

The question is how to filter this text? Looking back at the XML Charset Spec it looks like all characters below 0x20 (space) except for 0x09 (tab), 0x0A (LF), 0x0D (CR) are illegal. So applying the following filter with a RegEx should work to remove invalid characters:

string code = Regex.Replace(item.Code, @"[\u0000-\u0008,\u000B,\u000C,\u000E-\u001F]", "");

Applying this RegEx to the code snippet (and title) eliminates the problems and the feed renders cleanly.

Posted in .NET  XML  

The Voices of Reason


 

rich
January 02, 2012

# re: XmlWriter and lower ASCII characters

Rick - you're beyond smart, but I can tell you weren't coding in the 1970s!!!

0x0A = LF
0x0C = FF (form feed - move to the start of the next form in the printer)
0x0D = CR

A very happy new year to you

Dan
January 02, 2012

# re: XmlWriter and lower ASCII characters

I'm astonished at this -- I thought that UTF-8 was specifically meant to be a superset of ASCII so that it was backwards compatible. http://en.wikipedia.org/wiki/Utf-8 You didn't include the entire XML document being served, so I can't tell: Did you include a document type declaration with a UTF-8 encoding attribute at the top? (Although it appears Microsoft assumes UTF-8 if it can't find a declaration: http://msdn.microsoft.com/en-us/library/aa468560.aspx )

Rick Strahl
January 02, 2012

# re: XmlWriter and lower ASCII characters

@rich - ha ha. Fat fingered. Yes of course 0x0D is the CR. Got it backwards :-) Luckily though the code is correct (skipping around 0x0D). Fixed in post. Thanks rich!

Janosch
August 20, 2012

# re: XmlWriter and lower ASCII characters

I think your regex is slightly wrong. I used it and found out that all commas have beed deleted, too. So I think the regex should rather be:

[\u0000-\u0008\u000B\u000C\u000E-\u001F]

Michael B
May 18, 2014

# re: XmlWriter and lower ASCII characters

Rich - you are funny. I think Rick belongs in a category that is a few grades above 'beyond smart'. Rick is my friggin hero and you sir are beyond smart to pick up on what you did. This community never ceases to me amaze me.

Matt
November 14, 2016

# re: XmlWriter and lower ASCII characters

Janosch is correct. The supplied regular expression removes commas. Use the updated one instead! [\u0000-\u0008\u000B\u000C\u000E-\u001F]

I don't know if Rick reads these comments on such an old article, but updating the original text would be great.

Thanks

FWIW: here is my code to strip out all invalid characters before writing a dataset to XML

foreach (DataRow dr in rsFiltered.Tables[0].Rows)
{
            foreach (DataColumn c in dr.Table.Columns)
            {
                if (c.DataType == typeof(String))
                {
                    dr[c] = ValidationHelper.ClearInvalidXML(dr[c].ToString());
                }
            }
        }

West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2024