PrEV
Thoughts from a NeXTStep Guy on Cocoa Development

libxml2 Push Parsing

Sep 03, 2009 by Bill Dudney

The fastest way to parse XML on iPhone OS is with libxml2. You can find out all about libxml2 from the official site here. There is a ton of functionality and we won't have time to go over all of it here, so you know where to go for the full details. In this article we are going to show how to use libxml2 in the real world of iPhone development instead of delve into its huge feature set.

Generally there are two approaches to parsing XML. The first is often referred to as DOM based parsing. In this approach the whole XML document is loaded into memory and an object-oriented hierarchical representation is built in memory. This is a great intuitive approach to using XML, but as you can imagine it can be hugely resource intensive. If you have a massive amount of XML it can sometimes be very difficult to use a DOM based parser because you can consume all your available memory. That is of course bad...

The second approach is called SAX parsing. This family of parsers uses an event model. Every time the parser finds something interesting in the XML (such as an element or set of characters) an 'event' happens that invokes a callback into your code so you can do what you will with that event. The advantage of this approach is that you can parse tons of XML and not run out of memory because the parser only keeps in memory a fraction of the entire XML document at once. The disadvantage is in writing the event call backs. It can be difficult to build the data structures you need to get the information you want from the XML. For example two elements in an RSS feed have a subelement called 'title'. In order to parse an RSS feed properly your event call back needs to know if its processing the item's title or the channel's title. So you have to keep track of what elements have been found and in what order so you know which title you have.

Which approach you take depends on your app, specifically how much XML you have to process. TouchXML is a great DOM based parser. The NSXMLParser class is the Foundation class for doing SAX based parsing. If you want to strictly stay in the ObjC world and not have to think about function pointers and the like then TouchXML or NSXMLParser are for you. And finally you have libxml2 as an alternative SAX parser. libxml2 is a lot less memory and processor intensive compared to either of the alternatives. Both TouchXML and NSXMLParser are built on top of libxml2 and both are optimized for their particular uses, i.e. please don't think I'm saying in this article not to use either of these alternatives. They are great for what they are meant to do. In this article I want you to see how to use libxml2 if your app uses tons of XML and needs to be able to parse it faster or with less memory usage.

To demonstrate the use of libxml2 I'm going to modify the EarthquakeParser I built for my Map Kit Screencast. In the example I parse an XML feed from the USGS to find earthquakes in the vicinity of the user and then plot them on the map. The XML parsing is not the focus of the screencast so I used NSXMLParser. Even though the application does not need a faster XML parsing it makes a good example to add libxml2 too. I'll leave the map part of the example to the screencast and just put the earthquakes into a table view. You might recognize this as an example from the ADC and while the display is similar the XML parsing will be totally different. You can look at the feed yourself here, you will likely have to 'view source' once navigating to that URL to see the XML.

Now that we have the intro out of the way lets get started by pulling the XML via an NSURLConnection. The NSURLConnection class is a beautiful expression of the simplicity of Cocoa/Foundation API's and design. Through a really simple API we get concurrent (i.e. threaded) downloading of files. It really is amazing how simple it is to get data from a url with this class. The official docs are here. I will summarize though.

When you start an NSURLConnection it creates a background thread to pull the data from the URL you provided in the NSURLRequest. When an interesting thing happens (like some data is received) the connection packages up the data and pushes a call to the delegate onto the main event thread via code that looks more or less like this;

    SEL selector = @selector(connection:didReceiveData:);
    NSMethodSignature *sig = [(id)self.delegate methodSignatureForSelector:selector];
    if(nil != sig && [self.delegate respondsToSelector:selector]) {
      NSInvocation *invocation = [NSInvocation invocationWithMethodSignature:sig];
      [invocation retainArguments];
      [invocation setTarget:self.delegate];
      [invocation setSelector:selector];
      [invocation setArgument:&self atIndex:2];
      [invocation setArgument:&receivedData atIndex:3];
      [invocation performSelectorOnMainThread:@selector(invoke) withObject:NULL
                                                        waitUntilDone:NO];
    }

I don't have access to Apple's source code, this is just informed speculation...

In this way you don't have to manage any threading state or locks or whatever. The data comes to you via the main event loop so you don't even have to be aware of the alternate thread. Here is a graphical representation of what is going on for the visual learners out there;

For each delegate method the connection invokes the same basic process happen. For example when an authentication challenge is issued by the server, the connection packages that up as a method invocation on the main thread. Your response is then captured and acted on when you return from the delegate method. This pattern is worth studying it makes so many things in Cocoa/CocoaTouch much simpler than they would otherwise be. It's worth your time to study it in depth.

For our example we are going to use the connection to pull our earthquake feed from the USGS. Which in this case leads us to the libxml2 parsing code. Lets start with setting up the connection.

  NSURLRequest *request = [NSURLRequest requestWithURL:[NSURL URLWithString:feedURLString]]
  NSURLConnection *con = [[NSURLConnection alloc] 
                          initWithRequest:request
                          delegate:self];
  self.connection = con;
  [con release];
  ...
  if(self.connection != nil) {
    do {
      [[NSRunLoop currentRunLoop] runMode:NSDefaultRunLoopMode 
                                                         beforeDate:[NSDate distantFuture]];
    } while (!_done && !self.error);
  }

We have to run the current run loop so that the connection will get messages about the underlying CoreFoundation stuff it uses to download the data. There is a run loop for every thread in your process which you can get by asking the NSRunLoop for it.

Next, here is the delegate method impl that the connection calls when data is received.

- (void)connection:(NSURLConnection *)connection 
    didReceiveData:(NSData *)data {
  // Process the downloaded chunk of data.
  xmlParseChunk(_xmlParserContext, (const char *)[data bytes], [data length], 0);
}

For each chunk of data that comes in we pass it off to the libxml2 function xmlParseChunk. We will look at creating the xmlParserContext in just a second for now look at how simple this is. All we do is pass the bytes from the data off to the libxml2 parser context and we are done. Nice and easy, beautiful. When the connection finishes downloading all the XML it sends us this message;

- (void)connectionDidFinishLoading:(NSURLConnection *)connection {
  // Signal the context that parsing is complete by passing "1" as the last parameter.
  xmlParseChunk(_xmlParserContext, NULL, 0, 1);
  _done = YES;
}

And we pass a NULL into the xmlParseChunck function to let libxml2 know that there is no more XML.

Now that we see how the connection invokes the xml parser code lets look at how we setup the xml parser.

  _xmlParserContext = xmlCreatePushParserCtxt(&simpleSAXHandlerStruct, self, NULL, 0, NULL);

Deceptively simple, all we do is call a single function to get libxml2 setup to parse our XML. The first argument is a pointer to a structure full of function pointers. This is how C based programs do callback's. Remember that SAX parsers are event based, so each significant event (like an element being found) is sent to a callback function. We pass in self as the 'context' pointer. The context is passed into each of the callback functions. We see in just a minute why that is important. In our case the list of functions looks like this;

	static xmlSAXHandler simpleSAXHandlerStruct = {
	  NULL,                       /* internalSubset */
	  NULL,                       /* isStandalone   */
	  NULL,                       /* hasInternalSubset */
	  NULL,                       /* hasExternalSubset */
	  NULL,                       /* resolveEntity */
	  NULL,                       /* getEntity */
	  NULL,                       /* entityDecl */
	  NULL,                       /* notationDecl */
	  NULL,                       /* attributeDecl */
	  NULL,                       /* elementDecl */
	  NULL,                       /* unparsedEntityDecl */
	  NULL,                       /* setDocumentLocator */
	  NULL,                       /* startDocument */
	  endDocumentSAX,             /* endDocument */
	  NULL,                       /* startElement*/
	  NULL,                       /* endElement */
	  NULL,                       /* reference */
	  charactersFoundSAX,         /* characters */
	  NULL,                       /* ignorableWhitespace */
	  NULL,                       /* processingInstruction */
	  NULL,                       /* comment */
	  NULL,                       /* warning */
	  errorEncounteredSAX,        /* error */
	  NULL,                       /* fatalError //: unused error() get all the errors */
	  NULL,                       /* getParameterEntity */
	  NULL,                       /* cdataBlock */
	  NULL,                       /* externalSubset */
	  XML_SAX2_MAGIC,             // initialized? not sure what it means just do it
	  NULL,                       // private
	  startElementSAX,            /* startElementNs */
	  endElementSAX,              /* endElementNs */
	  NULL,                       /* serror */
	};

That looks like a lot but really its just placing the function pointers (endDocumentSAX for example) into the correct slot in the xmlSAXHandler structure so that libxml2 knows which function to call for each event. So when an element is found the startElementSAX function will be invoked. The function declaration (which arguments etc) are all defined and documented on the libxml2 website. In don't have time to look at each of these functions (the example code has all this laid out, look there for the rest of them) so we will just take a look at one, the startElementSAX function.

static void startElementSAX(void *ctx, const xmlChar *localname, const xmlChar *prefix,
                            const xmlChar *URI, int nb_namespaces, const xmlChar **namespaces,
                            int nb_attributes, int nb_defaulted, const xmlChar **attributes) {
  [((EarthquakeParser *)ctx) elementFound:localname prefix:prefix uri:URI 
                       namespaceCount:nb_namespaces namespaces:namespaces
                       attributeCount:nb_attributes defaultAttributeCount:nb_defaulted
                           attributes:(xmlSAX2Attributes*)attributes];
}

Here we see why having the context defined as an ObjC object is a 'good thing'. Since the context is our EarthquakeParser we can invoke methods on this object and thus move from the C world back into the comfortable ObjC world. The implementation of the elementFound:prefix:uri:namespaceCount:namespaces:attributeCount:defaultAttributeCount:attributes: is too long to place here but a couple of interesting things to note.

The xmlSAX2Attributes structure is a type of my own (that I found through googling) that makes it easier to process the attributes. The attributes are sent to the startElementSAX function as a list of lists. The second dimension of the array is always 5 long so I made a struct that makes it easier to decode.

struct _xmlSAX2Attributes {
  const xmlChar* localname;
  const xmlChar* prefix;
  const xmlChar* uri;
  const xmlChar* value;
  const xmlChar* end;
};
typedef struct _xmlSAX2Attributes xmlSAX2Attributes;

The other interesting thing about the attributes is that the value is not NULL terminated which means you have to do some pointer arithmetic to figure out how long the value really is. We will see that in a moment. Now that we know about the attributes let's look at how to use the elementFound... method to get data out of our XML.

First for each element that I'm interested in the XML I define two constants like this.

	static const char *kEntryElementName = "entry";
	static NSUInteger kEntryElementNameLength = 6;

These two constants give us a constant to compare the element name against using strncmp which is faster that its cousin strcmp. In this method I set up a if/else if/else block, one if or else if for each element I want to process. The code looks like this.

if(0 == strncmp((const char *)localname, kEntryElementName, kEntryElementNameLength)) {
  self.currentEarthquake = [[[Earthquake alloc] init] autorelease];
} else if(0 == strncmp((const char *)localname, kLinkElementName, kLinkElementNameLength)) {
  for(int i = 0;i < attributeCount;i++) {
	...
    if(0 == strncmp((const char*)attributes[i].localname, kRelAttributeName,
                    kRelAttributeNameLength)) {
      // process the rel attribute
	} else if(...) {
      // process a different attribute
	}
	...
  }
} else if(0 == strncmp((const char *)localname, kTitleElementName, kTitleElementNameLength)) {
  _title = [[NSMutableString alloc] init];
  _parsingTitle = YES;
}

In this code we process each interesting element and each interesting attribute from each of the elements. Of course there is a lot of detail on how these elements and attributes are processed, but that is mostly application specific. For this example there are a couple of elements (rel for example) that have attributes that need to be processed. For other elements (like title) we only care about the text contained in the element. Your XML schema and the data you need from instances of that schema will dictate what kind of processing you need to do here. When you do care about the content of the element you can process it as we do here. Turn on a flag that says you are parsing the title and make a mutable string to hold the found characters. Then in the libxml2 callback invoke another method on your parser like this.

static void	charactersFoundSAX(void *ctx, const xmlChar *ch, int len) {
  [((EarthquakeParser *)ctx) charactersFound:ch length:len];
}

The parser then processed the characters like this.

- (void)charactersFound:(const xmlChar *)characters length:(int)length {
  if(_parsingTitle) {
    NSString *value = [[NSString alloc] initWithBytes:(const void *)characters
                                               length:length encoding:NSUTF8StringEncoding];
    [_title appendString:value];
    [value release];
  }
  ...
}

Then you can capture the characters you found in your XML via the end of element function like this.

static void endElementSAX(void *ctx, const xmlChar *localname, const xmlChar *prefix,
                          const xmlChar *URI) {    
  [((EarthquakeParser *)ctx) endElement:localname prefix:prefix uri:URI];
}

The parser processes the end of the title element like this.

- (void)endElement:(const xmlChar *)localname
            prefix:(const xmlChar *)prefix
               uri:(const xmlChar *)URI {
  if(0 == strncmp((const char *)localname, kTitleElementName, kTitleElementNameLength)) {
    //M 5.8, Banda Sea
    NSArray *components = [_title componentsSeparatedByString:@","];
    if(components.count > 1) {
      // strip the M
      NSString *magString = [[[components objectAtIndex:0] 
                               componentsSeparatedByString:@" "] objectAtIndex:1];
      NSNumberFormatter *formatter = [[NSNumberFormatter alloc] init];
      self.currentEarthquake.magnitude = [formatter numberFromString:magString];
      self.currentEarthquake.place = [components objectAtIndex:1];
      [formatter release];
    }
    [_title release];
    _title = nil;
    _parsingTitle = NO;
  } else if(0 == strncmp((const char *)localname, kUpdatedElementName,
                         kUpdatedElementNameLength)) {
	...
  }
}

So that pretty much wraps up how I use libxml2. I hope its useful to you. Before we part though I wanted to show you an ObjectAlloc trace from Instruments.

As you can see from this graph the NSXMLParser based solution takes more memory and more time. Of course you get a bunch for going this route, the NSXMLParser protects you from C function pointers and xmlChar pointers to pointers. If you aren't quite ready to take the plunge into the intricacies of C feel free to stay with the NSXMLParser.

You can grab the sample project here.

Happy Parsing!

[UPDATE] There was a syntax error (thanks for pointing it out Shogo) in the upload code bundle, so if you've downloaded and can't get it to work please grab the bundle again.



Comments:

Hi Bill,

In addition to wanting to use libxml2 for our iPhone App, we also have a C/Carbon daemon which use the older CFXML API's in CoreFoundation. The 10.6 release notes clearly state these call will be obsoleted. I looked in Xcode 3.2 and discovered I can add libxml2.2.7.3, which appears to be newer than the link you posted. If your a brave soul and developing on 10.6, Xcode 3.2 already includes libxml2.

Posted by Ned Hogan on September 03, 2009 at 04:43 PM MDT #

Nice write up, thanks! Just one addition: Usually there is one more common way of parsing XML and that is pull-parsing. Not sure about the implementations in the C/ObjC world though.

Posted by Torsten Curdt on September 03, 2009 at 04:43 PM MDT #

@Torsten - from reading what I could find on pull-parsing, it sounds like the same thing that libxml2 refers to as push parsing, I guess its a matter of perspective (you are pulling incrementally from the server, and pushing incrementally into the parser). Thanks for the comment!

@Ned - Of course I'm on Snow Leopard :) All you have to do to use libxml2 is to include the library in your list of linked frameworks and it works like a charm. The example project has the lib added.

Posted by Bill Dudney on September 03, 2009 at 04:49 PM MDT #

If you're thinking about using TouchXML, you should also take a look at KissXML which was inspired by TouchXML but supports the full NSXML API as well as XML generation.

I'm currently using KissXML + ASIHTTPRequest for SOAP integration and it works a treat (although I'd much rather be using REST/JSON than SOAP/XML). At the moment I'm forgoing memory concerns and using DOM based parsing, because it lets me use XPath expressions to simplify the integration a little. I figure that if I hit memory issues I'll do the drudge work of converting my XML parsing to SAX-based parsing, but in the mean time XPath is much more expressive.

Posted by Nathan de Vries on September 04, 2009 at 08:27 AM MDT #

Hi Bill,

Thanks for a great tutorial. Not sure if you are aware but at the time of downloading (4th Sept 09) you have a syntax error in the sample code.

In particular, EarthquakeParser.m :-

1) There is a missing semi-colon at the end of the line ...
NSURLRequest *request = [NSURLRequest requestWithURL:[NSURL URLWithString:feedURLString]]
(Method is - (BOOL)parseWithLibXML2Parser)

Hope this helps
Shogo
Bone Desert Software

Posted by Shogo on September 04, 2009 at 08:27 AM MDT #

@Shogo - thanks for pointing it out, very irritating as I ran it again before uploading but that file must not have been on disk when I clicked the go button. grr anyway its fixed now.

thanks again.

Posted by Bill Dudney on September 04, 2009 at 08:30 AM MDT #

Thank u very much ,this tutorial help me a lot in starting libxml parsing

Posted by vikrant dwivedi on June 30, 2012 at 05:50 AM MDT #

DON'T DO THIS:

_xmlParserContext = xmlCreatePushParserCtxt(&simpleSAXHandlerStruct, self, NULL, 0, NULL)

...libxml is badly broken if you pass anything except "NULL" as the 2nd argument. Took me many hours, but I eventually found a reference on a mailing list where someone debugged this, but libxml hasn't so far been updated to fix the bug (writing now in 2013).

It will SEEM TO WORK, but internally, libxml breaks various things - particularly entity handling :(.

Posted by adam on January 08, 2013 at 01:22 PM MST #

thanks you so much...
your code is so simple to understand :)

Posted by Vaibhav Saran on March 26, 2013 at 05:39 AM MDT #

Post a Comment:
  • HTML Syntax: Allowed