Some thoughts on Xml parsing with XPathNavigator

March 01, 2007 • from Maui, Hawaii • 2 comments

On this page:

I’ve been working on a project that’s required a good amount of XML parsing recently. This is nothing new – dealing with data in XML one way or another has been a way of life for me for close to 10 years now. .NET makes working with XML pretty painless – however, in recent years, I’ve not looked much at low level XML at all, but rather the XML presentation is wrappered up as part of higher level frameworks: Web Services, or XSD generated schema wrapper classes often make it all to easy to forget the angle brackets underneath. In fact, l've kind of gotten out of practice when it comes to dealing with schema referencing and manually fiddling with the more advanced XML methodologies.

One of the projects I’m working on deals with parsing OFX financial investment data as part of a larger system that uses this data as base input. And as it turned out the OFX schema doesn’t lend itself to be very easily shoehorned into something that you can use in .NET with a higher level of abstraction. Schema imports create datastructures that are horrible and in effect less navigatable (<s>) than the raw XML file. Even after looking around for other solutions most of these seemed to fall short of the desired result and so in the end I made the decision that we’d have to parse this data on our own. It seems crazy that there's not a canned solution for OFX data around that's decent.

Anyway, it’s been a while for me since I've mucked with raw XML and I was happy to find that in .NET 2.0 one of the improvements is the XPathNavigator which provides a powerful and easy way to navigate an XmlDocument through code.

XPath of course is a very powerful and I’ve been using it happily for many years in combination with XmlDom even in pre-.NET days, but in .NET 2.0 the new XPathNavigator adds a bunch of extremely nice features to the XPathability of your code.

I don’t want to get into all of the details of XPathNavigator – you can find a few articles online that talk in some detail about and a good introduction is Brian Noyes article here.

When you’re using the XmlDocument class in .NET, behind the scenes it actually uses XPathNavigator for any XPath navigation with selectNodes and selectSingleNode operations. The XPathNavigator is essentially an abstraction that exposes a high level interface to parse through the document. It’s like a node pointer to the XML document that moves every time you navigate.

This concept takes a little getting used to – as you’ll quickly realize often times a forward only pointer can be a problem if you’re reading data repeatedly from a single base node. Not a problem and easily fixed but if you haven't used the XPathNavigator before a little weird. XPathNavigators are basically sophisticated node pointers - sort of like a row pointer in a database - and although it may feel a little unnatural it’s Ok to create or clone navigators so you can navigate each of them individually – or more commonly keep one navigator fixed at a given node while you navigate the other.

Here's a simple example:

<OFX>

<SONRS>

<MESSAGE>Success</MESSAGE>

</STATUS>

<FI>

<ORG>amerivestinc.com</ORG>

</FI>

</SONRS>

</SIGNONMSGSRSV1>

<OFX>

Take this fairly simple XML and notice how easily you can pull data out of it with an XPathNavigator.

XmlDocument Dom = new XmlDocument();

Dom.Load(Server.MapPath("FiResponse.xml"));

XPathNavigator nav = Dom.CreateNavigator();

if (nav.MoveToFollowing("CODE",""))

Response.Write(nav.ValueAsInt);

if (nav.MoveToNext("MESSAGE",""))

Response.Write(nav.Value);

This is a pretty painless way to get through a document and even easier than using SelectNodes or SelectSingleNode on the DOM directly.

Note that many of the MoveToXXX functions require a namespaceUri parameter which seems a bit odd. Rather than allowing use of namespace prefixes you have to use the full namespaceUri as a parameter. If I change the above XML to:

<SONRS>

<ww:CODE>0</ww:CODE>

<ww:MESSAGE>Success</ww:MESSAGE>

</STATUS>

<FI>

<ORG>amerivestinc.com</ORG>

</FI>

</SONRS>

</SIGNONMSGSRSV1>

</OFX>

The code needs to change a bit to:

XmlDocument Dom = new XmlDocument();

Dom.Load(Server.MapPath("Fi.xml"));

// *** Pick up the ww: namespace prefix

string ns = Dom.DocumentElement.Attributes["xmlns:ww"].Value;

Response.Write(ns);

XPathNavigator nav = Dom.CreateNavigator();

if (nav.MoveToFollowing("CODE",ns))

Response.Write(nav.ValueAsInt);

if (nav.MoveToFollowing("MESSAGE",ns))

Response.Write(nav.Value);

Note that when referencing the element names the ww: prefix is NOT used – the explicit namespaceUri serves that purpose. If you have a complex schema keeping track of the namespaces will become a little more tricky. If you need to pick up the default namespace in a document this works well:

string ns = Dom.DocumentElement.NamespaceURI;

The namespace cannot be null - it can be "" for an empty default namespace.

Forward Only

As I mentioned the forward only operation of the Navigator takes a little getting used to. If you need to get node content in a document where nodes can be ordered randomly, walking through nodes with MoveToFollowing or MoveNext() may not get you the expected results. For example in the Xml above if I switch around the two calls to MoveToFollowing like this:

if (nav.MoveToFollowing("MESSAGE", ns))

Response.Write(nav.Value);

if (nav.MoveToNext("CODE",ns))

Response.Write(nav.ValueAsInt);

Would only find Message, but miss CODE because the navigator has already walked past the CODE node.

To avoid this sort of thing you need to make a copy of the navigator. A better way to code this might be:

XPathNavigator nav = Dom.CreateNavigator();

nav.MoveToFollowing("STATUS", ns);

XPathNavigator TNav = nav.Clone();

if (TNav.MoveToChild("MESSAGE", ns))

Response.Write(TNav.Value);

TNav.MoveToParent();

if (TNav.MoveToChild("CODE",ns))

Response.Write(TNav.ValueAsInt.ToString());

This way you can get basically get out of sequence access to the child nodes. If you’re moving back up to a higher level node yet, then you might want to clone the original navigator instead of MoveToParent which is common if you use MoveToFollowing().

MoveToFollowing and End ranges

As useful as the MoveToFollowing function is you have to be careful with that one – because by default it’ll run until the end of the document to find a matching node. If you have a document that has variable lengths of child nodes it can easily happen that a node in a particular subtree is missing so MoveToFollowing in that case won’t find a match – and happily go on to the next subtree for parsing. That can – uhm result in some interesting side effects <s>.

To avoid this you can specify an EndNavigator which specifies a node at which the search ends. This is useful, but not as easy as it sounds, because in many cases you’ll want to limit the search for a give sub node tree. So the limiting Navigator would be – the next sibling node. Well in some cases there’s no sibling, so it’d be the parent’s next sibling, and if there’s none of those, then that parents and so on...

More generic MoveToFollowing

So in the application I’m working on there’s a lot of back and forth in the document nodes or getting information out of various subtrees, where only some of the information available is read. MoveToFollowing is a big help in this scenario as I can jump directly to nodes without any intermediate navigation. For example, in the XML snippet tree above I might only be interested in the error code, message and FID. These nodes are all over the hierarchy and so using MoveToFollowing makes this dirt simple. But it could be easier yet abstracting away the namespace and cloning of navigators.

Unfortunately, XPathNavigator can’t be subclassed directly without a fair amount of new implementation code and even if you did CreateNavigator() throws you a canned instance.

So the next best thing is subclassing XmlDocument and adding more high level methods that provide easy XPath movement. There are a few abstractions in my subclass that make XPath access a bit easier using wrappered methods that retrieved typed values directly. The wrapper methods also handle creating temporary navigators so the ‘master’ navigator stays at a fixed position. Using this wrapper makes code parsing super easy and more importantly deals with the error handling if nodes are not found by returning default values.

For example, here’s a parse routine for a fairly deeply nested node tree.

public bool Parse()

{

XPathNavigator nav = this.Dom.Navigator;

string ns = this.Dom.ns;

// *** Now let's go get the information about each position

nav.MoveToRoot();

if (!nav.MoveToFollowing("SECLISTMSGSRSV1", ns))

{

this.SetError("Data doesn't contain Securities List");

return false;

}

// *** We'll store our 4 properties in a PositionEntity

Dictionary<string, PositionsEntity> PosList = new Dictionary<string, PositionsEntity>();

nav.MoveToFollowing("SECLIST", ns);

// *** Start by getting all the Security Names and base info

XPathNodeIterator StockList = nav.SelectChildren(XPathNodeType.Element);

foreach (XPathNavigator XPathPos in StockList)

{

string Id = this.Dom.GetFollowingString(XPathPos, "UNIQUEID");

SecuritiesEntity Security = new SecuritiesEntity();

Security.RequestId = this.Parser.RequestId;

// *** STOCK,MF,DEBT,OTHER, OPT Format: STOCKINFO, MFINFO etc. strip off INFO

Security.AccountType = XPathPos.Name.Replace("INFO", "");

Security.UniqueId = this.Dom.GetFollowingString(XPathPos, "UNIQUEID");

Security.UniqueIdType = this.Dom.GetFollowingString(XPathPos, "UNIQUEIDTYPE");

Security.Name = this.Dom.GetFollowingString(XPathPos, "SECNAME");

Security.Ticker = this.Dom.GetFollowingString(XPathPos, "TICKER");

Security.Rating = this.Dom.GetFollowingString(XPathPos, "RATING");

… Additional keys omitted here

this.Securities.Add(Security.UniqueId, Security);

}

return true;

}

The code uses standard XPathNavigator for base navigation, but for actual node retrieval the specialty wrapper functions are used. Again, what’s cool here is that various nodes live in various sublevels of the Xml hierarchy and the MoveToFollowing wrappers basically let me get those values out without any regards to the hierarchy and without intermediate navigation. Although intermediate navigation and using direct child access is probably faster the above code is easier to write and maintain.

Another thing nice about XPathNavigator is that is damn fast. The test documents I’m parsing are over 150k in size and deeply nested (up to 20 levels deep in some places!). Yet the parsing time of the data is barely measurable in a single pass.

The XmlDocument wrapper might seem an odd choice for subclassing these XPath functions, but rather than create yet another class that had to be passed around with my parsers it seems to make sense to add this functionality on the Dom object which is already passed as part of the top level request parser. I don’t see a need to have XmlDocument, a navigator and a navigator helper. Ideally the methods should have landed on the XPathNavigator, but given the inheritance structure that’s not possible without a lot of work.

So XmlDocument it is and here’s what that subclass looks like:

public class wwXPathXmlDocument : XmlDocument

{

/// <summary>

/// Expose the base navigator persistent for parsers

/// </summary>

public XPathNavigator Navigator

{

get { return _Navigator; }

set { _Navigator = value; }

}

private XPathNavigator _Navigator = null;

/// <summary>

/// Keep track of the root namespace that's used

/// throughout parsing.

/// </summary>

public string ns

{

get { return _ns; }

set { _ns = value; }

}

private string _ns = "";

/// <summary>

/// Override Createnavigator so that we can hook in

/// capturing the Namespace and Navigator instance

/// </summary>

/// <returns></returns>

public new XPathNavigator CreateNavigator()

{

this.Navigator = base.CreateNavigator();

if (this.Navigator == null)

return null;

this.ns = this.DocumentElement.NamespaceURI;

return this.Navigator;

}

/// <summary>

/// Retrieves an XPathNavigator for an immediate child node of the navigator

/// passed specified by the element id

/// </summary>

/// <param name="nav"></param>

/// <param name="ElementId"></param>

/// <param name="NameSpaceUri"></param>

/// <returns></returns>

public XPathNavigator GetChild(XPathNavigator nav, string ElementId, string NameSpaceUri)

{

XPathNavigator TNav = nav.CreateNavigator();

if (string.IsNullOrEmpty(NameSpaceUri))

NameSpaceUri = this.ns;

if (TNav.MoveToChild(ElementId, NameSpaceUri))

return null;

return TNav;

}

/// <summary>

/// Retrieves an XPathNavigator for an immediate child node

/// specified by the element id

/// </summary>

/// <param name="nav"></param>

/// <param name="ElementId"></param>

/// <returns></returns>

public XPathNavigator GetChild(XPathNavigator nav, string ElementId)

{

return this.GetChild(nav, ElementId, this.ns);

}

/// <summary>

/// Navigates to the following node with a given ElementId at any level in the

/// hierarchy.

/// </summary>

/// <param name="nav"></param>

/// <param name="ElementId"></param>

/// <param name="NameSpaceUri"></param>

/// <returns></returns>

public XPathNavigator GetFollowing(XPathNavigator nav,string ElementId, string NameSpaceUri)

{

// *** Create copies so we don't navigate the passed navigator

XPathNavigator TNav = nav.CreateNavigator();

XPathNavigator TEnd = nav.CreateNavigator();

if (string.IsNullOrEmpty(NameSpaceUri))

NameSpaceUri = this.ns;

// *** Find the next sibling node

if (!TEnd.MoveToNext())

{

// *** Move up the tree until we find a parent's sibling

while (true)

{

if (TEnd.MoveToParent())

{

if (!TEnd.MoveToNext())

// *** move up one level higher

continue;

else

// *** Found our matching node

break;

}

else

{

// *** No next siblings - must be at the end of the document

// *** Allow searching to end

TEnd = null;

break;

}

if (!TNav.MoveToFollowing(ElementId, NameSpaceUri, TEnd) )

return null;

return TNav;

}

/// <summary>

/// Navigates to the following node with a given ElementId at any level in the

/// hierarchy.

/// </summary>

/// <param name="nav"></param>

/// <param name="ElementId"></param>

/// <returns></returns>

public XPathNavigator GetFollowing(XPathNavigator nav, string ElementId)

{

return this.GetFollowing(nav,ElementId,null);

}

#region Wrappers for easy access to typed result values with error handling

public string GetFollowingString(string ElementId)

{

return this.GetFollowingString(this.Navigator, ElementId);

}

public string GetFollowingString(XPathNavigator nav, string ElementId)

{

XPathNavigator TNav = GetFollowing(nav,ElementId);

if (TNav == null)

return string.Empty;

return TNav.Value;

}

public int GetFollowingInt(string ElementId)

{

return this.GetFollowingInt(this.Navigator,ElementId);

}

public int GetFollowingInt(XPathNavigator nav, string ElementId)

{

XPathNavigator TNav = GetFollowing(nav,ElementId);

if (TNav == null)

return 0;

return TNav.ValueAsInt;

}

public decimal GetFollowingDecimal(string ElementId)

{

return this.GetFollowingDecimal(this.Navigator, ElementId);

}

public decimal GetFollowingDecimal(XPathNavigator nav, string ElementId)

{

XPathNavigator TNav = GetFollowing(nav,ElementId);

if (TNav == null)

return 0.0M;

return (decimal) TNav.ValueAsDouble;

}

public DateTime GetFollowingDateTime(string ElementId)

{

return this.GetFollowingDateTime(this.Navigator, ElementId);

}

public DateTime GetFollowingDateTime(XPathNavigator nav, string ElementId)

{

XPathNavigator TNav = GetFollowing(nav,ElementId);

if (TNav == null)

return App.APP_MINDATE;

return OfxUtils.OfxDateToDate(TNav.Value);

}

public string GetChildString(XPathNavigator nav, string ElementId)

{

XPathNavigator TNav = GetChild(nav, ElementId);

if (TNav == null)

return string.Empty;

return TNav.Value;

}

public decimal GetChildDecimal(XPathNavigator nav, string ElementId)

{

XPathNavigator TNav = GetChild(nav, ElementId);

if (TNav == null)

return 0.0M;

return (decimal)TNav.ValueAsDouble;

}

public int GetChildInt(XPathNavigator nav, string ElementId)

{

XPathNavigator TNav = GetChild(nav, ElementId);

if (TNav == null)

return 0;

return TNav.ValueAsInt;

}

public DateTime GetChildDateTime(XPathNavigator nav, string ElementId)

{

XPathNavigator TNav = GetFollowing(nav, ElementId);

if (TNav == null)

return App.APP_MINDATE;

return OfxUtils.OfxDateToDate(TNav.Value);

}

#endregion

}

The key methods are the GetChild and GetFollowing which return XPathNavigators or Null if the node couldn't be found. The GetFollowing method searches only subtrees from the current node and it handles the creating of tempoary navigators for the navigation and end nodes. It also deals with finding the 'end node' to search which is a little more complex than it sounds as you have to ensure the end of a node list is handled (since there are no siblings).

The various GetXXX methods then are implementations of the GetChild and GetFollowing methods that returned typed values and return default values if nodes are not found. In this particular application I'm working on often times nodes are not available with one FI or another and so having default values is very desirable.

Obviously not rocket science, but this has really reduced the amount of code that needs to be written for the actual data parsing and this code will be handy in just about any manual parsing scenario I suspect.

The Voices of Reason

Kaerber
March 06, 2007

# re: Some thoughts on Xml parsing with XPathNavigator

If your data is read only, why not use XPathDocument instead of XmlDocument? It's faster, lighter and more optimized for working with XPath queries.

Wally Valters
March 13, 2007

# Not required to be valid XML in OFX

and oftentimes OFX is not valid XML in the first place...

<OFX>
 <SIGNONMSGSRSV1>
  <SONRS>
   <STATUS>
    <CODE>15500
    <SEVERITY>ERROR
    <MESSAGE>Signon Invalid
   </STATUS>
 ...
  </SONRS>
 </SIGNONMSGSRSV1>
</OFX>

Rick Strahl's Weblog