Binary XML (was: Re: Draft minutes of 15 March 2005 Telcon)

Robin Berjon writes:

> Some of the things people were spending time on
> were XML-related. For example UTF-8 to UTF-16
> conversion

I don't think I buy this as a rationale for a binary XML standard.  The 
line of reasoning I see in the above is: 

XML is text, often UTF-8.  As an industry we went and cooked up APIs that 
pass around all the strings as UTF-16, which to be fair is common on many 
platforms.  Not surprisingly, there are conversion overheads, and I agree 
they are very significant.

Why does this problem justify a binary XML standard?   Instead of making 
the platform or the API more efficient at dealing with UTF-8, which seems 
like a good investment on that platform, we're going to force the whole 
industry to accept interchange of a new form of XML?  Maybe or maybe not 
that binary form's representations of strings will go into your API with 
lower conversion overhead, but I do note that Java in particular uses 
UTF-16 under the covers, and you can if you wish use UTF-16 for XML today. 
 

We've done some work in this area in IBM.  I am not at all convinced that 
the answer to platforms and API's that are bad at manipulating UTF-8 is to 
define a binary XML.  There's a lot you can do to avoid character 
conversions of you're careful and your API is suitably designed.  Indeed, 
it seems to me that things are just dandy in XML for use with platforms 
that do UTF-8 efficiently.  Will the binary form be faster or slower for 
them?

> or assigning data types with schema to make a
> PSVI. If a binary format already has the PSVI
> information

I think you need to be very careful heading down this path, depending on 
your use case.  The term PSVI in particular relates to schema validation. 
In many cases the reason you are doing schema validation is because you 
don't entirely trust the source of the data.  Once you're doing other 
aspects of validation to check the data, I would claim (having built such 
systems) that type assignment is nearly free in many cases.   The same is 
true for many deserialization use cases, even where you don't use xml 
schema for validation:  if you know you're deserializing a "quantity" 
field then the deserializer very often has static knowledge that it's an 
int.  I don't see why there's overhead for that in the common use cases.

Maybe what you're hinting is that for an integer you're going to send the 
binary "int" and not the character string.  If so,  then that's not XML in 
a deeper sense, and the fact that you know the "PSVI" type is incidental 
to the fact that you've moved from characters to abstract numbers.   With 
the binary "int", you can't distinguish "123" from "00123", and that's a 
huge difference.  For example, an XML DSIG over the two would be 
different.  In any case, now you're into sending something closer to a 
subset of the XPath 2.0 XQuery data model than optimized XML.  An 
interesting thing to consider, but it has all sorts of deep implications. 
SOAP, in particular, uses infosets.  In a SOAP message, "123" is different 
from "00123", even if the schema or xsi:type claims you've got an integer. 
 DSIGs on the two will be different.

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------








Chris Lilley <chris@w3.org>
03/17/05 12:31 PM
Please respond to Chris Lilley

 
        To:     Robin Berjon <robin.berjon@expway.fr>
        cc:     noah_mendelsohn@us.ibm.com, www-tag@w3.org
        Subject:        Re: Draft minutes of 15 March 2005 Telcon


On Wednesday, March 16, 2005, 7:54:21 PM, Robin wrote:

RB> noah_mendelsohn@us.ibm.com wrote:


>> DO: I thought that one of the interesting presentations at the workshop
>> from Sun analyzed not just message size (and thus network overhead) but
>> also what was happening in the processor.
>> ... A lot of time was spent in the binding frameworks.
>> ... Even if you came along and doubled the network performance by 
>> halving the size, you might get only 1/3 of improvement

RB> Yes, if you're doing a lot of other things that aren't XML, then 
RB> speeding up XML won't help. But when you're rendering an SVG document
RB> and the vast majority of your time is spent waiting for the network 
and
RB> parsing the XML, then you know there's going to be speedup.

Some of the things people were spending time on were XML-related. For
example UTF-8 to UTF-16 conversion (to create a DOM) or assigning data
types with schema to make a PSVI.

If a binary format already has the PSVI information and speeds up the
production of a DOM (or obviates the need to construct a separate data
structure to implement the DOM APIs eficiently, might be a better way of
putting it) that would result in a significant speedup.

It might not be measured in x times smaller or x times faster to parse,
though. But it would show up in transactions-per-second measurements.



-- 
 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 W3C Graphics Activity Lead

Received on Thursday, 17 March 2005 19:47:18 UTC