No really. RegEx is nice when you can use it, and even nicer if you can find a ready made expression for your needs. Which of course, promptly will turn out to be falling just short of doing exactly what you really need, at which point your flung into parsing hell trying to figure out a 100 character string of gobbledygook.
I understand the power of RegEx, but heck, I dread the times when I need to use it to accomplish some text parsing. I’m not cut out for cryptic code like RegEx, Perl or XSLT for that matter. I think my eyes constantly want to a Word Jumble and rearrange things into real words. I like big nice drawn out logical structures. But RegEx of course does things that are difficult to accomplish in other ways – at least without reams of code – even if it probably is no faster to arrive at the solution <g>. Those 10 characters of mystic significance seem to take just as long to write as the 200 lines of parsing code they might replace.
Anyway, today I’m in the process of building a Page parser for Web Connection that is capable of basically parsing an ASPX like template page into source code.
Maybe some of you that are a little more RegEx savvy can help me out here. I’m working on some parsing code that essentially parses a script template similar to an ASPX page. What I need to do initially is retrieve all tags in the HTML document.
The following expression works well for this:
<(.|\s)+?>
This returns a set of matches for each of the tags in the document including multi-line tags, which is great for the parser I'm building. I can basically run through the list parse the tags and extract out text and replace it with the generated output.
But – and there’s always a ‘but’ isn’t there – I run into problems if there’s HTML embedded inside one of the attributes.
<% Page
id="test"
value=""
%>
<html>
<body>
<ww:TextBox id="value" runat="server" text="<b>this is good</b>" custom="test" />
The problem is that the above expression will match the <b> inside the attribute string and consider it the end tag. So I basically need to exclude any > tags that are contained inside of string delimiters.
I don’t even know where to start on how to do this… Any hints?
Incidentally I’m glad for Roy Overshore’s Regulator which has been an immense help in the handful of expressions I had to build today. It’s a great tool for checking out RegEx expressions with built-in Intellisense for most expression characters and structures which can be a great help if you’re like me and very slowly poking away at expressions.
Helpful as that is, I probably should look into a better reference. I’ve been using my dog-eared copy of C# Text Manipulation Handbook, but the RegEx portion of this book is not very strong. Any suggestions?
Other Posts you might also like