Rick Strahl's Weblog  

Wind, waves, code and everything in between...
.NET • C# • Markdown • WPF • All Things Web
Contact   •   Articles   •   Products   •   Support   •   Advertise
Sponsored by:
West Wind WebSurge - Rest Client and Http Load Testing for Windows

Lookbehind in Regex searches


:P

I’ve said it before – Regex is not one of my strengths and although I use Regex expression quite frequently in code I’m in fear (literally) of using longer Regex expressions, fully expecting to look at my own Regex code and not remember what it does10 minutes later :-}. Ok, not quite so bad, but not completely off the mark.

Today I had what is a fairly simple problem – I needed to match a JSON string and check for invalid quote characters in the string. For example:

JSON String: \"that has legal nested quotes\" and "illegal nested quotes" embedded in it

(leading and ending JSON quote marks are trimmed out prior to matching)

Basically what I needed to do is match the quotes around the illegal string to determine if the string is invalid JSON. This is somewhat tricky because the rule is to basically find double-quote characters that are NOT preceded by a slash.

The solution to this is quite easy – once you know about a feature called Lookbehind – that basically lets you match or not match a group and not include the content in the match expression. Lookbehind uses (?<=) to match or  (?<!) expression followed by actual text or Regex values to match inside of the parenthesis. The negating Lookbehind  basically allows you to find a string if the lookbehind expression is NOT found. Exactly what I need above for my JSON string – I need to match all qutoes that are NOT preceeded by a slash.

Using Lookbehind the solution to matching the illegal quotes is as simple as:

($<!\\)"

Ah the beauty of terseness. Nice and self-describing, n’est pas? NOT. Here’s a more visual view in RegexBuddy which is my preferred tool for Regex testing:

RegExBuddy 

As you can see the the legal quotes preceded by the slashes are not matched which is as it should be.

In my .NET code that parses a JSON string (as part of full JSON parser) here’s the routine that uses the Regex expression:

private static Regex FindUnquotedStringRegEx = new Regex(@"(?<!\\)""");

/// <summary>
/// Parses a JSON string into a string value
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public string ParseString(string value)
{                                    
    // actual value of null is not valid for 
    if (value == null)
        throw new ArgumentException(Resources.ERROR_INVALID_JSON_STRING);

    // null as a string is a valid value for a string
    if (value == "null")
        return null; 
    
    // Has to be at least 2 chars long and bracketed in quotes
    if (value.Length < 2 || (!value.StartsWith("\"") || !value.EndsWith("\"")))
        throw new ArgumentException(Resources.ERROR_INVALID_JSON_STRING);
    
    if (value == "\"\"")
        return string.Empty;
    
    // strip off leading and trailing quote chars
    value = value.Substring(1, value.Length - 2);
    
    // Check for strings NOT preceeded by a backslah - invalid
    if (FindUnquotedStringRegEx.IsMatch(value))
        throw new ArgumentException(Resources.ERROR_INVALID_JSON_STRING);

    // Escape the double escape characters in json ('real' backslash)  temporarily to alternate chars
    const string ESCAPE_ESCAPECHARS = @"^#^#";

    value = value.Replace(@"\\", ESCAPE_ESCAPECHARS);

    value = value.Replace(@"\r", "\r");
    value = value.Replace(@"\n", "\n");
    value = value.Replace(@"\""", "\"");            
    value = value.Replace(@"\t", "\t");
    value = value.Replace(@"\b", "\b");
    value = value.Replace(@"\f", "\f");

    if (value.Contains("\\u"))
        value = Regex.Replace(value, @"\\u....",
                              new MatchEvaluator(this.UnicodeEscapeMatchEvaluator));

    // Convert escaped characters back to the actual backslash char 
    value = value.Replace(ESCAPE_ESCAPECHARS, "\\");

    return value;
}

And it works.

I’ve never really used Lookbehind before (gulp) and when I popped out a question regarding the matches I was looking for earlier today on Twitter several folks were very helpful in pointing me in the right direction. Thanks to lumbarius who got the expression and a link to find out more.

I dug out my trusty copy of O’Reilly’s Mastering RegEx Expressions out today and did a little more reading on the topic.

Mastering Regular Expressions
by Jeffrey Friedl
O'Reilly Media, Inc. (August 8, 2006)

Read more...

I’ve certainly read through that section on Lookahead and Lookbehind before but at the time it really didn’t mean much and certainly didn’t sink in, but now with a little context behind it I think… I think… I can maybe retain this for more than 10 minutes. And if not I have this blog post to help me remember. If I can remember the blog post – Arrgh… where does it stop?

Posted in CSharp  RegEx  

The Voices of Reason


 

The Luddite Developer
October 03, 2009

# re: Lookbehind in Regex searches

I totally agree with you about the terseness of regex.

Code should be people friendly, that is why we have programming languages and compilers.

Regex and vi are, and never were, people friendly.

To many C# also falls a little short in terms of being people friendly, with its use of shortcuts and varies combinations of <>[]{}()%'"#^&¦¦==; Although C# programmers will undoubtedly defend the language vigourously for it power and productivity.

VB also has its faults, but reading C# is like reading without glasses (if you need to wear glasses you will know what I mean) whereas reading VB everything appears to be much more in focus. (but as you will have guessed I am not a C or a C# person.

I mention this because Microsoft are struggling to win the hearts and minds of many programmers, both experienced and new, to their new technologies. I think that MS would have more success if they promoted their new technologies using VB.

Would the C# community be offended or honoured if we referred to them as the 'alpha geeks'?

MS technology evangelists (and these alpha's) are very smart, work very hard, are tenacious in finding out how things works and produce great work. That is why they get paid the big bucks. However, they are a minority.

Steve Smith
October 03, 2009

# re: Lookbehind in Regex searches

I gave up on trying to remember the regex syntax since I only used it infrequently and had to re-learn it each time. I just use http://RegExLib.com for my regular expressions 90% of the time now. If you need to match a particular pattern, odds are someone else has already figured out how to do it and listed it in the library (or one close enough that you can modify it).

Rick Strahl
October 03, 2009

# re: Lookbehind in Regex searches

@Luddite - I think it's all a matter of preference. I absolutely HATE looking at VB code and to me C# flows just fine. There are a few constructs (like the immediate if ?: syntax for example) that's a bit terse but overall I think the language hits the right notes between readability and verbosity. I also think that it's a better fit than VB for .NET simply because VB carries around legacy baggage that is better handled by the framework.

But all that aside - it's a preference. Mine to be specific. I'm certainly not going to say that VB is an inferior language (or vice versa). It's not, but some people will be more comfortable with one or the other and that's great. Choices are good.

I also think it's quite a difference in terms of readability between any language and Regex. Regex is never clean and readable IMHO, but it does do some things much better than any other technology out there. Personally I think it'd be nice if there was some sort of fluent interface language that accomplishes what Regex does but with more a more verbose and maintainable API front end. Ah - one can dream I guess. :-}

Ben Amada
October 03, 2009

# re: Lookbehind in Regex searches

Only partially tested, but I think you can use a non-lookbehind regex too:

[^\\]"

Rick Strahl
October 04, 2009

# re: Lookbehind in Regex searches

Thanks Ben, (removed previous discussion since it was confusing). Yes that works too except it'll miss a single quote if its the first character in the string.

This looks like it'll work though:
(^")|[^\\]"
Thanks. Like I said before - my Regex can use a little help :-}.

Dennis Bailey
October 05, 2009

# re: Lookbehind in Regex searches

@Rick,

Although not specifically related to REGEX, I'm glad to see I'm not the only one that is fear of these things. Although I come from a background of HP/UX, RS6000's and AS/400's, it always seems that things like this are still daunting. You should read the grep statements that we use on a regular basis it would make regex look simple. Lately I have come to love the KISS principle and that applies here. Why make something so complicated that you can't read it or understand it later. An old programmer who taught me COBOL, RPG, and C once said - the fancier you get, the harder it is for the next person who has to work on the program to understand what you have done. It is a lesson to be learned.

@Ludditte - I did not understand your statement "Microsoft is struggling to win the hearts and minds of many programmers" (That's not meant to offend). I don't think that Microsoft has anything to do with it. As an Instructor at a college, students ask my opinion on where they should begin in regards to programming. I always point them to learning logic, then move on to basic and go from there. If you understand logic and have a good grasp of programming in general the language used becomes a non-issue. As a consultant, I deal with programmers from every walk of life, we don't ever discuss things in term of Microsoft. For my more 'alpha geek' cohorts we always have a laugh because we remember the time when Borland's compiler produced much tighter construct. Microsoft's C/C++ compiler creates bloat and continues to create bloat although they have gotten better, it continues to be a struggle. What I like, and this is my opinon, about .NET is (as Rick correctly states) that you have a choice. With the Java framework we have only C++ style construct, with .NET you can choose your language and no matter what you choose, it is all interoperable. I can write part of the program in VB.NET and some in C# but they both work together fine. Which gets to my final point, Interoperablity. It is the key and that is what is important especially when dealing with existing code. Like Rick dealing with FOXPRO, I maintain code in PHP, VB 6, C, C++, Java, and C#... After long diliberations with the 'Alpha Geeks', who wanted JAVA, we all came to the mutual consensus that there was no way for us to convert all of this to .NET or JAVA. We chose .NET as our platform since all of us could read both VB.NET and C# code easily. Throw in the fact that we now have WCF we are able to leave the front end in its original language and write the backend in .NET ... It's a compromise by far, but what we have found is that development time has decreased and that is what was important. When thinking in terms of interoperablity we no longer care about the language. That leaves us time to work on performance. But this is all my opinion, rambling, and observation..

The Luddite Developer
October 05, 2009

# re: Lookbehind in Regex searches

@Rick and Dennis

Both of you state that it really is a matter of choice and personal preference, and with that I do agree. In fact there really is not too much to separate VB and C# in terms of what can be achieved with each.

What I see at the moment is that the choice is being taken away. Many Microsoft products and tutorials only show examples in C#. This is forcing VB programmers to change their preference (no longer do they have a choice). The manual and all samples for expression blend 3 are shown in C# only. Most books (especially for the new technologies like WPF and Silverlight) are now C# only, where is the choice?

I am quite comforatable with using multiple languages having started with Assembly Language and proceeded through Algol, Fortran, Cobol, Basic, Pascal, OPL, PL/SQL, SQL, VB, VB.NET, PHP, Javascript, C, C++ and C#. Microsoft did promise us that we would have a choice, but if they were being really honest they (and I am sure that both of you) would actually tell new users that if you want to learn just one then learn C#. That was NOT the case when .NET was first released. Microsoft encouraged VB programmers telling them that VB would be fully supported. Currently VB has the half hearted support of Microsoft and the.NET Community.

OK that is my little rant and nothing to do with regex.

@Rick thanks for the tip about RegExBuddy.

Peter Coles
March 05, 2010

# re: Lookbehind in Regex searches

Nice use case of a lookbehind, it tends to be a subject that people gloss over when learning about regex’s. If you do forget the syntax, at least you now know the concept and how to use it :)

West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2025