Once upon a time there was a Ruby library called Hpricot. Well it’s still here in fact. This library is the de facto standard for parsing HTML in Ruby, and is often used to parse XML as well.
Hpricot is normally considered to be quite fast, as far as Ruby libraries go. Yet Nokogiri recently garnered some buzz thanks to a microbenchmark that emphasized its speed over Hpricot’s, when it comes to parsing XML in a microbenchmark setting. And I can’t stress the “micro” part enough, since this was the file that was tested:
<b>Singapore</b> is an island-state in Southeast
Asia, connected by bridges to Malaysia. Founded as a British trading colony
in 1819, since independence it has become one of the <b>world's
most prosperous countries</b> and sports the world's busiest
port. Combining the skyscrapers and subways of a <b>modern,
affluent city</b> with a medley of Chinese, Indian and Malay
influences and a <b>tropical climate</b>, with
tasty food, good shopping and a vibrant nightlife scene, this Garden City
makes a great stopover or springboard into the region.
Over the weekend, why made a few tweaks to his library et voilà, it was suddenly faster than Nokogiri in terms of parsing an XML document smaller than 18K. It’s nice to see them striving to improve the speed of these libraries. After all, for parsing HTML or even the occasional small XML document, those two Ruby libraries are fine and have their place.
DB2 users do not generally need them for XML though. In fact, DB2 offers a technology called pureXML to help in this area. In short, XML documents can be stored, indexed, cached, queried, updated, validated and compressed within the database in XML columns. This means that the data is securely stored, properly backed up, and easily restored. What’s more, there is no need to parse large strings to obtain an object representation of the XML document(s). Queries and updates require no parsing at all, since the XML is stored in a parsed hierarchical format. You simply ask for the data that you need (or need to update) directly, and DB2 will eagerly oblige. You can use XPath, XQuery, and also integrate SQL and XML queries to retrieve relational and XML data. It’s as easy as it gets, and of course, all this supports Unicode.
And DB2 pureXML is blazingly fast. How fast? Well, at IBM we like benchmarks too, only we don’t use 18K of XML data. With very little tweaking (pretty much letting DB2 automate and self-tune almost everything), DB2 was tested with the help of the good folks over at Intel, on the latest Xeon processors. The TPoX (Transaction Processing over XML Data) Open Source benchmark was used. This is a well balanced and realistic benchmark that proposes a financial transaction processing scenario. The raw XML data was 1 Terabyte and was stored in three very straightforward tables (with XML indexes). DB2 managed to store all the information, including the indexes, in just 440 GB. And the row data was 1 TB without indexes.
Compression aside, we were pretty much blown away by how immensely fast DB2 is. The throughput with 200 concurrent users was stable throughout the 2 hour run, and under a mixed workload, performed about 34 million queries, almost 7 million updates, almost 4 million insertions and deletitions (for a total of almost 48.5 million transactions). To be exact, the average was 6,763.42 transactions per second. The two tables upon which inserts were performed did 4,913 insert per second (4 to 20Kb of data) and 11,904 inserts per second (1 to 2Kb), respectively. Not only is DB2 often benchmarked as the fastest database in the world when it comes to relational data, but it’s also undisputedly state of the art when it comes to XML handling.
As a side node, by switching from Intel Xeon 7300 processors (4 cores) to Intel Xeon 7400 CPUs (6 cores) the number of cores was increased by 50%. DB2 managed to increase its throughput by 48%. You know, we don’t waste anything in this neighborhood. 😉 For full details about the results of the benchmark, you can download the slides of the IOD presentation (PDF warning).
Antonio, I hear you say, we don’t have that kind of hardware. That’s true, but you probably don’t need to process 18 Million documents an hour either. What you do have is that kind of software – for free. In fact, DB2 Express-C is a free version of DB2 that doesn’t impose any restrictions on the size of the database, how many users can be connected or how many databases you can have. It has the same code that ran the test above. So you can have a lightening fast XML engine that opens up a world of possibilities, free of charge.
If your startup or established company is serious about XML, DB2 Express-C 9.5 is a godsend. Now you can even try DB2 Express-C 9.5.2 Beta (currently available only on Windows). This version ships with both pureXML and a very fast Text Search technology, so that you don’t have to use fuzzy creatures like ferrets and sphinxes.
Disclaimer: The opinions expressed in this post, and any last minute remarks about how Oracle’s license won’t allow me to publish comparative benchmarks, are mine and mine alone, and do not necessarily represents the opinion of my employer, IBM, or the aforementioned Intel.