Contact   •   Products   •   Search

Rick Strahl's Web Log

Wind, waves, code and everything in between...
ASP.NET • C# • HTML5 • JavaScript • AngularJs

Wishful Thinking: Why can't HTML fix Script Attacks at the Source?


The Web can be an evil place, especially if you're a Web Developer blissfully unaware of Cross Site Script Attacks (XSS). Even if you are aware of XSS in all of its insidious forms, it's extremely complex to deal with all the issues if you're taking user input and you're actually allowing users to post raw HTML into an application. I'm dealing with this again today in a Web application where legacy data contains raw HTML that has to be displayed and users ask for the ability to use raw HTML as input for listings.

The first line of defense of course is: Just say no to HTML input from users. If you don't allow HTML input directly and use HTML Encoding (HttyUtility.HtmlEncode() in .NET or using standard ASP.NET MVC output @Model.Content) you're fairly safe at least from the HTML input provided.

Both WebForms and Razor support HtmlEncoded content, although Razor makes it the default.

In Razor the default @ expression syntax:

@Model.UserContent

automatically produces HTML encoded content - you actually have to go out of your way to create raw HTML content (safe by default) using @Html.Raw() or the HtmlString class.

In Web Forms (V4) you can use:

<%: Model.UserContent %>    

or if you're using a version prior to 4.0:

<%= HttpUtility.HtmlEncode(Model.UserContent) %>


This works great as a hedge against embedded <script> tags and HTML markup as any HTML is turned into text that displays as HTML but doesn't render the HTML. But it turns any embedded HTML markup tags into plain text. If you need to display HTML in raw form with the markup tags rendering based on user input this approach is worthless.

If you do accept HTML input and need to echo the rendered HTML input back, the task of cleaning up that HTML is a complex task.

In the projects I work on, customers are frequently asking for the ability to post raw HTML quite frequently.  Almost every app that I've built where there's document content from users we start out with text only input - possibly using something like MarkDown - but inevitably users want to just post plain old HTML they created in some other rich editing application. See this a lot with realtors especially who often want to reuse their postings easily in multiple places.

In my work this is a common problem I need to deal with and I've tried dozens of different methods from sanitizing, simple rejection of input to custom markup schemes none of which have ever felt comfortable to me. They work in a half assed, hacked together sort of way but I always live in fear of missing something vital which is *really easy to do*.

My Wishlist Item: A <restricted> tag in HTML

Let me dream here for a second on how to address this problem. It seems to me the easiest place where this can be fixed is: In the browser. Browsers are actually executing script code so they have a lot of control over the script code that resides in a page. What if there was a way to specify that you want to turn off script code for a block of HTML?

The main issue when dealing with HTML raw input isn't that we as developers are unaware of the implications of user input, but the fact that we sometimes have to display raw HTML input the user provides. So the problem markup is usually isolated in only a very specific part of the document.

So, what if we had a way to specify that in any given HTML block, no script code could execute by wrapping it into a tag that disables all script functionality in the browser? This would include <script> tags and any document script attributes like onclick, onfocus etc. and potentially also disallow things like iFrames that can potentially be scripted from the within the iFrame's target.

I'd like to see something along these lines:

<article>    
    <restricted allowscripts="no" allowiframes="no">
        <div>Some content</div>
        <script>alert('go ahead make my day, punk!");</script>
        <div onfocus="$.getJson('http://evilsite.com/')">more content</div>
    </restricted>
</article>

A tag like this would basically disallow all script code from firing from any HTML that's rendered within it. You'd use this only on code that you actually render from your data only and only if you are dealing with custom data. So something like this:

<article>    
    <restricted>
        @Html.Raw(Model.UserContent)
    </restricted>
</article>

For browsers this would actually be easy to intercept. They render the DOM and control loading and execution of scripts that are loaded through it. All the browser would have to do is suspend execution of <script> tags and not hookup any event handlers defined via markup in this block. Given all the crazy XSS attacks that exist and the prevalence of this problem this would go a long way towards preventing at least coded script attacks in the DOM. And it seems like a totally doable solution that wouldn't be very difficult to implement by vendors.

There would also need to be some logic in the parser to not allow an </restricted> or <restricted> tag into the content as to short-circuit the rstricted section (per James Hart's comment). I'm sure there are other issues to consider as well that I didn't think of in my off-the-back-of-a-napkin concept here but the idea overall seems worth consideration I think.

Without code running in a user supplied HTML block it'd be pretty hard to compromise a local HTML document and pass information like Cookies to a server. Or even send data to a server period. Short of an iFrame that can access the parent frame (which is another restriction that should be available on this <restricted> tag) that could potentially communicate back, there's not a lot a malicious site could do.

The HTML could still 'phone home' via image links and href links potentially and basically say this site was accessed, but without the ability to run script code it would be pretty tough to pass along critical information to the server beyond that.

Ahhhh… one can dream…

Not holding my breath of course. The design by committee that is the W3C can't agree on anything in timeframes measured less than decades, but maybe this is one place where browser vendors can actually step up the pressure. This is something in their best interest to reduce the attack surface for vulnerabilities on their browser platforms significantly.

Several people commented on Twitter today that there isn't enough discussion on issues like this that address serious needs in the web browser space. Realistically security has to be a number one concern with Web applications in general - there isn't a Web app out there that is not vulnerable. And yet nothing has been done to address these security issues even though there might be relatively easy solutions to make this happen.

It'll take time, and it's probably not going to happen in our lifetime, but maybe this rambling thought sparks some ideas on how this sort of restriction can get into browsers in some way in the future.

Make Donation
Posted in ASP.NET  HTML5  HTML  Security  


Feedback for this Post

 
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Bertrand Le Roy April 14, 2012 @ 6:23pm
Well, you'd have to pick a different tag: noscript already exists and has different semantics.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Rowen April 14, 2012 @ 6:27pm
I agree, great idea which could really simplify things as you explain. However we'll need to think of another name... the noscript tag is taken ;) Maybe something more generic like <protected> where you could have attributes that specify different levels of protection (dealing with script, iframes, content from different origin etc.)
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Rick Strahl April 14, 2012 @ 6:47pm
Duh! Right. Forgot about the existing <noscript> tag. Updated to <restricted> instead with attributes for what's allowed.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by James Hart April 14, 2012 @ 6:50pm
Well, for one, HTML already HAS a <noscript> tag, so you'll need to come up with a different name for it. And second, what is to stop your malicious user from injecting a </noscript> tag into the page to prematurely terminate your script-injection-free-zone?

All in favor of some sort of mechanism which makes it possible for websites to take more control over which parts of their content are executable, and which aren't, but I think it has to be more sophisticated than this, sadly.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Rick Strahl April 14, 2012 @ 7:18pm
@James - Good point :-) But the DOM parser would have to also take the <restricted> tag into account and allow neither of this start or end tag to be interpreted. Again the browser could fairly easily deal with this as it's creating the DOM model.

Updated post with a note in that regard. Thanks James.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Colin April 16, 2012 @ 7:40am
I completely agree with this idea. Being able to restrict script tags in elements would be a useful tool, but instead of having a seperate tag, would it be a better idea to set it up as an attribute for any tag? eg.

<div id="mydiv" allowscript="false" allowiframe="false">
 
</div>


and in that way it could be applied to any element.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Rick Strahl April 16, 2012 @ 12:37pm
@Colin - yeah that works for me too, but I suspect it'd be easier to add a new element than add functionality to all elements. In the end it'll be a parser behavior more than an element feature.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Harry M April 16, 2012 @ 5:23pm
For now, our best hope is the google caja project. It parses and rewrites HTML+JS into a safe format, but it's not compatible with every single JS library unfortunately.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Dave Reed April 16, 2012 @ 9:50pm
Sure, it's a shame we don't have something like this. To me though it's only part of the problem. Sometimes you actually do want to allow script, you just want to have some control over what that script can do (e.g. in mashups). Currently if you want to run some script in a sandbox you're pretty much limited to using an iframe, and one with a different domain to boot. It'd be great to see this as part of a more overarching sandboxing/container feature.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Rick Strahl April 16, 2012 @ 10:36pm
@Dave - agreed. We still need something to wrap around specific sections in a page rather than the full DOM level sandboxing that's happening now. Other things like totally disallowing access to Cookies and headers on the client side would also be helpful.

It wouldn't be needed on all pages only those where potentially dangerous content might be uploaded.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Rohland April 17, 2012 @ 1:49am
Hey Rick. My feeling is that it would be better to introduce a helper method such as SafeRaw that uses something like the GetSafeHtmlFragment from the Microsoft Anti-XSS library. @Html.SafeRaw could strip out any scripts included by a suspect user which aren't in the supported whitelist. The problem with this mechanism and the one you describe is that both still rely on the developer to actively think about security aspects when developing. IMO @Html.Raw should do this by default with an overload which allows the user to include "dangerous" tags if they really trust the source.

Here's a link to a blog post with some detail on the Anti-XSS library: http://blog.aggregatedintelligence.com/2010/03/microsofts-anti-xss-library.html

PS. I couldn't comment on this last night on my iPad - the bot validation seems to fail consistently :/
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Amit April 23, 2012 @ 9:49am
Rick, won't the attackers be able to craft the scripts like this?

</restricted>
    <script>alert('go ahead make my day, punk!");</script>
<restricted allowscripts="no" allowiframes="no">
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Rick Strahl April 23, 2012 @ 1:37pm
@Rohland - maybe I misunderstand the MS Anti-XSS library but it's about HTML encoding which removes HTML ability to render. That works, but it's not the use case I'm after - I want to allow embeddable HTML, but allow only save tags (basically anything but <script>,<iframe> and any attribute events.

In the end I ended using the Html Agility Pack (http://htmlagilitypack.codeplex.com) to write a small routine that does exactly that by parsing the DOM tree. I suspect this is overly simplistic but after a day of trying to find something that actually fit what I needed this is what I ended up with.

public class HtmlSanitizer
{
        
    public HashSet<string> BlackList =  new HashSet<string>() 
    {
            { "script" },
            { "iframe" },
            { "form" },
            { "object" },
            { "embed" },
            { "link" }
    };
        
    /// <summary>
    /// Cleans up an HTML string and removes HTML tags
    /// </summary>
    /// <param name="html"></param>
    /// <returns></returns>
    public static string SanitizeHtml(string html)
    {
        var sanitizer = new HtmlSanitizer();
        return sanitizer.Sanitize(html);
    }
 
    /// <summary>
    /// Cleans up an HTML string by removing elements
    /// on the blacklist and all elements that start
    /// with onXXX.
    /// </summary>
    /// <param name="html"></param>
    /// <returns></returns>
    public string Sanitize(string html)
    {
        var doc = new HtmlDocument();
        doc.LoadHtml(html);
        SanitizeNode(doc.DocumentNode);
        return doc.DocumentNode.WriteTo();
    }
 
    private void SanitizeNode(HtmlNode node)
    {
        if (node.NodeType == HtmlNodeType.Element)
        {
            if (BlackList.Contains(node.Name))
            {
                node.ParentNode.RemoveChild(node);
                return;
            }
 
            // remove on script attributes
            if (node.HasAttributes)
            {
                for (int i = node.Attributes.Count - 1; i >= 0; i--)
                {
                    HtmlAttribute currentAttribute = node.Attributes[i];
                    if (currentAttribute.Name.ToLower().StartsWith("on"))
                    {                        
                        node.Attributes.Remove(currentAttribute);
                    }
                }
            }
        }
 
        // Look through child nodes recursively
        if (node.HasChildNodes)
        {
            for (int i = node.ChildNodes.Count - 1; i >= 0; i--)
            {
                SanitizeNode(node.ChildNodes[i]);
            } 
        }
    }
}


I realize this still allows for 'phone home' type of attacks via images, styles and href links, but it does remove the ability to run code as far as what I can think of.

In fact I think I'll post this in seperate blog entry and see what we can come up with.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Mario Marinero April 30, 2012 @ 2:04pm
We'll be able to use something similar with the sandbox attribute but with the limitation that the content has to be placed in an iframe.
http://blogs.msdn.com/b/ie/archive/2011/07/14/defense-in-depth-locking-down-mash-ups-with-html5-sandbox.aspx

However I think the problem with embedding arbitrary HTML has no solution. Being able to protect your users from the most obvious attacks is convenient but not enough.
# re: Wishful Thinking: Why can't HTML fix Script Attacks at the Source?
by Rohland May 02, 2012 @ 3:27am
@Rick- I don't think it's simply encoding. The description of the GetSafeHtmlFragment outlines the following: "Returns a safe version of HTML fragment by either sanitizing or removing all malicious scripts". So, in essence, if a user has maliciously included a script or iFrame tag in their post (let's say it's a comment on a blog), the XSS library will automatically ensure these tags are removed when you render the content on the front end (via @Html.SafeRaw for example). Does this align to your use case?
 


West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2014