Expanding Urls with RegEx in .NET

December 21, 2006 • from Maui, Hawaii • 8 comments

On this page:

One of the things I need frequently is to take text entered and expand URLs in it. There’s code in my library that handles this via an ExpandUrls class that does this:

public class ExpandUrlsParser

{

public string Target = "";

/// <summary>

/// Expands links into HTML hyperlinks inside of text or HTML.

/// </summary>

/// <param name="Text">The text to expand</param>

/// <param name="Target">Target frame where output is displayed</param>

/// <returns></returns>

public string ExpandUrls(string Text)

{

string pattern = @"[""'=]?(http://|ftp://|https://|www\.|ftp\.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])";

// *** Expand embedded hyperlinks

System.Text.RegularExpressions.RegexOptions options =

RegexOptions.IgnorePatternWhitespace |

RegexOptions.Multiline |

RegexOptions.IgnoreCase;

System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex(pattern, options);

MatchEvaluator MatchEval = new MatchEvaluator(this.ExpandUrlsRegExEvaluator);

return Regex.Replace(Text, pattern, MatchEval);

}

/// <summary>

/// Internal RegExEvaluator callback. Expands the URL

/// </summary>

/// <param name="M"></param>

/// <returns></returns>

private string ExpandUrlsRegExEvaluator(System.Text.RegularExpressions.Match M)

{

string Href = M.Value; // M.Groups[0].Value;

// *** if string starts within an HREF don't expand it

if (Href.StartsWith("=") ||

Href.StartsWith("'") ||

Href.StartsWith("\""))

return Href;

string Text = Href;

if (Href.IndexOf("://") < 0)

{

if (Href.StartsWith("www."))

Href = "http://" + Href;

else if (Href.StartsWith("ftp"))

Href = "ftp://" + Href;

else if (Href.IndexOf("@") > -1)

Href = "mailto:" + Href;

}

string Targ = !string.IsNullOrEmpty(this.Target) ? " target='" + this.Target + "'" : "";

return "<a href='" + Href + "'" + Targ +

">" + Text + "</a>";

}

This class basically takes a string as input and then spits out a formatted document that expands all URLs in the document skipping over any already expanded URLs etc. It can be used like this:

protected void Page_Load(object sender, EventArgs e)

{

ExpandUrlsParser Parser = new ExpandUrlsParser();

Parser.Target = Target;

string Result = Parser.ExpandUrls("this is a test for www.asp.net expressions. http://www.west-wind.com. www.west-wind.NET");

Response.Write( Result);

return;

}

Or a little easier with a wrapper function in my utility class:

/// <summary>

/// Expands links into HTML hyperlinks inside of text or HTML.

/// </summary>

/// <param name="Text">The text to expand</param>

/// <param name="Target">Target frame where output is displayed</param>

/// <returns></returns>

public static string ExpandUrls(string Text, string Target)

{

ExpandUrlsParser Parser = new ExpandUrlsParser();

Parser.Target = Target;

return Parser.ExpandUrls(Text);

}

I’ve been using this parser for quite some time, but there is one bug that’s been eluding me that bonks on .net extensions. As it turns out I was missing a closing bracket and that fixed trapping the .net extension properly, but now I’ve run into another issue...

So as I was mucking around with the expression I ran into another odd bug where the parser for whatever reason was not picking up the IgnoreCase option and failing on upper case extensions. I went back and forth checking my code, chekcing the expressions in various different RegEx parsers including my favorite RegexBuddy and Roy Osherove's Regulator (which is built in .NET so it uses the same parser your code is using). Nothing couldn’t get it to run...

As it turns out in my over eagerness I missed actually APPLYING the damn options to the Regex.Replace() call... Talk about feeling like a dolt and losing an hour that I’ll never get back <g>...

For me that’s exactly the problem with RegEx expressions. If something – even the simplest thing – goes wrong with a RegEx expression I’ll be some time trying to re-figure out how the damn expression works and by then my eyes are so cross-eyed I can’t see the forest for the trees anymore <g>...

Actually it looks like the code I had previously didn’t apply these flags either so at least that particular bug has been fixed... Not all wasted time at least...

The Voices of Reason

Bryan Peters
December 21, 2006

# re: Expanding Urls with RegEx in .NET

Thanks for sharing!

I've been linking all http/www in a string, but this limited people's ability to put in custom links of their own.

I've been meaning to go back through my code and tweak my auto-linking function, and this seems to be exactly what I need to get back in there and clean out my code. Thanks again!

Jeff Atwood
December 21, 2006

# re: Expanding Urls with RegEx in .NET

You can make this code simpler by using a negative lookbehind to automatically *not* match URLs preceded by http://

(?<!http://)www\.[A-Z\-]+\.\w{2,3}

Try it out in regexbuddy with this sample text:

"some site www.asp.com is here to stay. www.west-wind.NET here's more in http://www.west-wind.com"

Jeff Atwood
December 21, 2006

# re: Expanding Urls with RegEx in .NET

Also, I'd suggest doing mailto: in a second pass. Mixing email addresses and URLs is too complicated for a single regex.

Rick Strahl
December 22, 2006

# re: Expanding Urls with RegEx in .NET

Jeff. Hey look - it's working <g>... BTW, Jeff you were the reason I actually started getting into RegExBuddy <g> from a reference you made in an earlier common on a post of mine.

Uhm, stripping off the http isn't really the idea. It just needs to be sure that the URL isn't already expanded inside of a quoted href=. That's what the ['"=] check is for which is stripped on the match function and just returned as is.

It should catch things like this: href="www.west-wind.com" and not expand that because that URL is already an HREF. I tried using the negative lookbehind (briefly) on the quote expression at the beginning but I can't get that to work. Don't have my Regex ref handy, but I guess it doesn't really matter. I can just capture the first character and the check in the match handler. Since I'll need that anyway to add additional processing for the URL creation this shouldn't be a problem.

I just realized the expression is missing support for query strings and anchors in all situations. This should do it though:

(["'=])?(http://|ftp://|https://|www\.|ftp\.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])

or

string pattern = @"[""'=]?(http://|ftp://|https://|www\.|ftp\.[\w]+)([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])";

Rick Strahl
December 22, 2006

# re: Expanding Urls with RegEx in .NET

One more tweak - when I do URL Expansion I tend to FIRST UrlEncode the input. So if have content like this from user input:

<a href="http://www.west-wind">West Wind</a>

it will turn into:

<a href="http://www.west-wind">West Wind</a>

So the expression needs to also check for the quot; prefix:

string pattern = @"([""'=]|&quot;)?(http://|ftp://|https://|www\.|ftp\.[\w]+)([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])";

You can see this work on the first <a> tag (but it still fails on the second because it's actually &ampquot; instead of &quot - I suppose I could fix that too but screw it - that's a heck of an unlikely scenario <s>).

Rick Strahl's Web Log
June 23, 2007

# Rick Strahl's Web Log

Bindesh Agrawal
May 23, 2008

# re: Expanding Urls with RegEx in .NET

This is the great post.I have a problem to find out anchor text of a link for ex.if url is
<a href="http://www.yahoo.com">Link Text</a>then i want to find out "Link Text" in the url i have the url so please give a solution or suggest the technique for solve it...

Thanks

Josh
July 08, 2014

# re: Expanding Urls with RegEx in .NET

In order for the options to work, the last line of the ExpandURLS function would need to be modified as following:

    Public Function ExpandUrls(Text As String) As String
 
        Dim pattern As String = "([""'=])?(http://|ftp://|https://|www\.|ftp\.[\w]+)([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])"
 
        ' *** Expand embedded hyperlinks
        Dim options As System.Text.RegularExpressions.RegexOptions = RegexOptions.IgnorePatternWhitespace Or RegexOptions.Multiline Or RegexOptions.IgnoreCase
 
        Dim reg As New System.Text.RegularExpressions.Regex(pattern, options)
 
        Dim MatchEval As New MatchEvaluator(AddressOf Me.ExpandUrlsRegExEvaluator)
        Return Regex.Replace(Text, pattern, MatchEval, options)
    End Function

Rick Strahl's Weblog