I’m a RegEx Weenie

August 22, 2005 • 13 comments

On this page:

No really. RegEx is nice when you can use it, and even nicer if you can find a ready made expression for your needs. Which of course, promptly will turn out to be falling just short of doing exactly what you really need, at which point your flung into parsing hell trying to figure out a 100 character string of gobbledygook.

I understand the power of RegEx, but heck, I dread the times when I need to use it to accomplish some text parsing. I’m not cut out for cryptic code like RegEx, Perl or XSLT for that matter. I think my eyes constantly want to a Word Jumble and rearrange things into real words. I like big nice drawn out logical structures. But RegEx of course does things that are difficult to accomplish in other ways – at least without reams of code – even if it probably is no faster to arrive at the solution <g>. Those 10 characters of mystic significance seem to take just as long to write as the 200 lines of parsing code they might replace.

Anyway, today I’m in the process of building a Page parser for Web Connection that is capable of basically parsing an ASPX like template page into source code.

Maybe some of you that are a little more RegEx savvy can help me out here. I’m working on some parsing code that essentially parses a script template similar to an ASPX page. What I need to do initially is retrieve all tags in the HTML document.

The following expression works well for this:

<(.|\s)+?>

This returns a set of matches for each of the tags in the document including multi-line tags, which is great for the parser I'm building. I can basically run through the list parse the tags and extract out text and replace it with the generated output.

But – and there’s always a ‘but’ isn’t there – I run into problems if there’s HTML embedded inside one of the attributes.

<% Page

id="test"

value=""

<html>

<body>

<ww:TextBox id="value" runat="server" text="<b>this is good</b>" custom="test" />

The problem is that the above expression will match the <b> inside the attribute string and consider it the end tag. So I basically need to exclude any > tags that are contained inside of string delimiters.

I don’t even know where to start on how to do this… Any hints?

Incidentally I’m glad for Roy Overshore’s Regulator which has been an immense help in the handful of expressions I had to build today. It’s a great tool for checking out RegEx expressions with built-in Intellisense for most expression characters and structures which can be a great help if you’re like me and very slowly poking away at expressions.

Helpful as that is, I probably should look into a better reference. I’ve been using my dog-eared copy of C# Text Manipulation Handbook, but the RegEx portion of this book is not very strong. Any suggestions?

The Voices of Reason

Dave A
August 22, 2005

# re: I’m a RegEx Weenie

I'm with you on this one. RegEx is as forgettable as XSLT or MDX.

This is probably not what you are after considering the amount of work that you have done but I am faced with a similar problem that I am scheduled to solve shortly.

Rather than the RegEx solution, I am considering throwing the text (HTML in my case) at the HTML to SGML converter found at http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

I don't know how it will go with the <% ... %> but I would imagine that it would go quite well.

Once it is in XML you can use the DOM to readily get out all of the tags.

Regards
Dave A

Rick Strahl
August 22, 2005

# re: I’m a RegEx Weenie

I actually had considered using XMLto parse document fragments. For example, I need to parse attributes which would be easy with the XMLDOM. Unfortunately tags can have namespaces (<asp:TextBox ...>) which doesn't work easily unless you assign a namespace etc. and then you need to first parse the tags anyway which kind of defeats the purpose. FWIW a simple, Attribute parsing expression - assuming you only allow double quote attribute tags is this:

\s\w+=".+?"

I originally allowed single tags but then I'm back to the same problem above of dealing with differentiating the different delimiteres (',")...

Tom Pester
August 23, 2005

# re: I’m a RegEx Weenie

The problem you need to solve is to match the qouted strings and let the matching continue after you matched these. An extra alternation in the regex does the trick for your sample data :

<(".*?"|.|\s)+?>

To match also single quoted strings the regex becomes :

<(".*?"|'.*?'|.|\s)+?>

I love regexes so if there are special case where it doesn't match then just let me know? (After reading some of your excellent posts I glad to help you)

Tom Pester
August 23, 2005

# re: I’m a RegEx Weenie

There are 2 sources that ended my regex hell.
Before that I was also hacking around and constructing regexes through trial and error.

Read this book and become a regex master :
http://regex.info/

I realy recommend it. You will speak and dream in regex afterwards ;)

I find this tool to be very powerful :
http://www.regexbuddy.com/

Check out powergrep for a second too cause its on of the best programs I saw just from a gui point of perspective.

Bob Archer
August 23, 2005

# re: I’m a RegEx Weenie

Rick,

I feel your pain. I cringe whenever I have to do regex, but struggle with it know it is the best way to do most string parsing.

However, I don't use it enough to remember it, or to want to read a whole book on the subject. I do have a "Tips" card from VisiBone which is a nice handy reference. It would probalby be more handy if I knew what half of the stuff ment.

http://www.visibone.com/javascript/foldouts.html

Regex is one of the three foldouts in the javascript collections.

BOb

Bertrand Le Roy
August 23, 2005

# re: I’m a RegEx Weenie

Regexes are really useful and efficient in many cases, and the actual ASPX parser does use them (and it's a big, fat, ugly RegEx, believe me). Now, most parsers actually don't use them. In your case, maybe it would be just simpler to go without them.
Don't get me wrong, if you plan to modify your parser's logic a lot, RegEx gives you flexibility that a custom parser will have difficulties achieving. But if your goal is just to extract the tags or their contents, something like this will probably be just as efficient and a lot easier to code:
- look at each char in the string
- Each time you see < and are not in quote mode, enter tag mode
- If in tag mode and see " or ', toggle quote mode
- If in tag mode and not in quote mode and see >, exit tag mode

Malcolm Greene
August 24, 2005

# re: I’m a RegEx Weenie

Rick,

I agree with Bertrand's approach. I've built lots of parsers in my day and the best technique I've found so far is to use a finite state machine type of approach vs. an approach based purely on regular expressions.

IMO, regular expressions are too brittle for real life parsing needs. They work perfect if your source is perfect, but quickly fall apart as soon as your source has errors.

Malcolm

Simon Ferguson
August 28, 2005

# re: I’m a RegEx Weenie

I've found a very useful regular expression for the case where you need to exclude matches that are enclosed by certain delimiters:

(?!((?!(?:start)).)*?(?:finish))

Replace start and finish with your delimiters of choice. The negative lookahead expression - ?! - tells the regex parser to "match on what isn't next". So to ignore double quotes, put the following regex after your expression:

(?!((?!(?:")).)*?(?:"))

To also handle single quotes, you could use (?!((?!(?:"|')).)*?(?:"|'))

BTW, for matching HTML tags

<[^<>]+>

might be a better choice than <(.|\s)+?>. Using the . token indiscriminately matches any character whereas the negated [^<>] expression is more specific.

Rick Strahl
September 19, 2005

# re: I’m a RegEx Weenie

Bob,

The Visibone cards are awesome. Highly recommended. For RegEx it's not the best but the Javascript, CSS and HTML DOM parts are great. I've been able to relegate several books to the bookshelf. Very cool... and it's small enough to take on the road with you. I do recommend you get the booklet format though - the booklet's dense as hell. I can't imagine what the cards look like.

Rick Strahl
September 19, 2005

# re: I’m a RegEx Weenie

Just as an updated, I ended up doing the HTML parsing in code and while it's a bit of code it's much more flexible. For one I can deal much better with control nesting than before and all of the quirks of embedded expressions inside of expressions (like <%= %> tags inside of server tags).

Thanks for all the comments. The parser is looking good for my needs at this point.

Andrew MacNeill - AKSEL Solutions
October 13, 2006

# Andrew MacNeill - AKSEL Solutions: August 2005

Garrett Fitzgerald
October 13, 2006

# re: I’m a RegEx Weenie

There are lots of helpful people over at http://regexadvice.com/, next time you need to figure out something like this. :-)

Andrew MacNeill - AKSEL Solutions
November 08, 2006

Rick Strahl's Weblog

I’m a RegEx Weenie

Other Posts you might also like

The Voices of Reason

# re: I’m a RegEx Weenie

# re: I’m a RegEx Weenie

# re: I’m a RegEx Weenie

# re: I’m a RegEx Weenie

# re: I’m a RegEx Weenie

# re: I’m a RegEx Weenie

# re: I’m a RegEx Weenie

# re: I’m a RegEx Weenie

# re: I’m a RegEx Weenie

# re: I’m a RegEx Weenie

# Andrew MacNeill - AKSEL Solutions: August 2005

# re: I’m a RegEx Weenie

# Andrew MacNeill - AKSEL Solutions: Rick Strahl: I'm a RegEx Weenie