I’ve said it before – Regex is not one of my strengths and although I use Regex expression quite frequently in code I’m in fear (literally) of using longer Regex expressions, fully expecting to look at my own Regex code and not remember what it does10 minutes later :-}. Ok, not quite so bad, but not completely off the mark.
Today I had what is a fairly simple problem – I needed to match a JSON string and check for invalid quote characters in the string. For example:
JSON String: \"that has legal nested quotes\" and "illegal nested quotes" embedded in it
(leading and ending JSON quote marks are trimmed out prior to matching)
Basically what I needed to do is match the quotes around the illegal string to determine if the string is invalid JSON. This is somewhat tricky because the rule is to basically find double-quote characters that are NOT preceded by a slash.
The solution to this is quite easy – once you know about a feature called Lookbehind – that basically lets you match or not match a group and not include the content in the match expression. Lookbehind uses (?<=) to match or (?<!) expression followed by actual text or Regex values to match inside of the parenthesis. The negating Lookbehind basically allows you to find a string if the lookbehind expression is NOT found. Exactly what I need above for my JSON string – I need to match all qutoes that are NOT preceeded by a slash.
Using Lookbehind the solution to matching the illegal quotes is as simple as:
($<!\\)"
Ah the beauty of terseness. Nice and self-describing, n’est pas? NOT. Here’s a more visual view in RegexBuddy which is my preferred tool for Regex testing:
As you can see the the legal quotes preceded by the slashes are not matched which is as it should be.
In my .NET code that parses a JSON string (as part of full JSON parser) here’s the routine that uses the Regex expression:
private static Regex FindUnquotedStringRegEx = new Regex(@"(?<!\\)""");
/// <summary>
/// Parses a JSON string into a string value
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public string ParseString(string value)
{
// actual value of null is not valid for
if (value == null)
throw new ArgumentException(Resources.ERROR_INVALID_JSON_STRING);
// null as a string is a valid value for a string
if (value == "null")
return null;
// Has to be at least 2 chars long and bracketed in quotes
if (value.Length < 2 || (!value.StartsWith("\"") || !value.EndsWith("\"")))
throw new ArgumentException(Resources.ERROR_INVALID_JSON_STRING);
if (value == "\"\"")
return string.Empty;
// strip off leading and trailing quote chars
value = value.Substring(1, value.Length - 2);
// Check for strings NOT preceeded by a backslah - invalid
if (FindUnquotedStringRegEx.IsMatch(value))
throw new ArgumentException(Resources.ERROR_INVALID_JSON_STRING);
// Escape the double escape characters in json ('real' backslash) temporarily to alternate chars
const string ESCAPE_ESCAPECHARS = @"^#^#";
value = value.Replace(@"\\", ESCAPE_ESCAPECHARS);
value = value.Replace(@"\r", "\r");
value = value.Replace(@"\n", "\n");
value = value.Replace(@"\""", "\"");
value = value.Replace(@"\t", "\t");
value = value.Replace(@"\b", "\b");
value = value.Replace(@"\f", "\f");
if (value.Contains("\\u"))
value = Regex.Replace(value, @"\\u....",
new MatchEvaluator(this.UnicodeEscapeMatchEvaluator));
// Convert escaped characters back to the actual backslash char
value = value.Replace(ESCAPE_ESCAPECHARS, "\\");
return value;
}
And it works.
I’ve never really used Lookbehind before (gulp) and when I popped out a question regarding the matches I was looking for earlier today on Twitter several folks were very helpful in pointing me in the right direction. Thanks to lumbarius who got the expression and a link to find out more.
I dug out my trusty copy of O’Reilly’s Mastering RegEx Expressions out today and did a little more reading on the topic.
|
Mastering Regular Expressions
by Jeffrey Friedl
O'Reilly Media, Inc. (August 8, 2006)
Read more... |
I’ve certainly read through that section on Lookahead and Lookbehind before but at the time it really didn’t mean much and certainly didn’t sink in, but now with a little context behind it I think… I think… I can maybe retain this for more than 10 minutes. And if not I have this blog post to help me remember. If I can remember the blog post – Arrgh… where does it stop?
Other Posts you might also like