RDFa worst case memory usage for SAX-based parsers the same as DOM-based parsers

Maybe the rest of you already knew this, but I just came to the
realization that SAX-based parsers for RDFa don't have any benefits vs.
DOM-based parsers as far as memory usage is concerned.

The root of the issue lies with XML Literals and Plain Literals. Since
these need to be tracked as you go down and back up the XHTML tree, you
end up holding almost every character of the XHTML document in memory.

Take this example:

<body about="">
   <span property="[foo:bar]" />
   <!-- repeat the span above 1000 times -->
</body>

Since the SAX-parser can't jump around in the DOM, it doesn't know if
the <body> element has a parent element that requires the XML Literal or
plain literal, so it must collect both, which takes a relatively large
amount of memory. The XML Literal for the <body> element ends up being a
direct copy of all 1001 <span> elements.

In the best implementation case for a SAX-based parser, you end up
storing almost the entire XHTML document in memory... making it no less
memory intensive than a DOM-based approach.

So much for a small memory footprint parser.

-- manu

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.
blog: DB Launches Medical Record Sales Service with Shepherd Medical
http://blog.digitalbazaar.com/2008/02/24/health2trade/

Received on Wednesday, 11 June 2008 02:10:57 UTC