Re: question: Increasing factor for XML vs Binary from Stephen D. Williams on 2004-11-19 (public-xml-binary@w3.org from November 2004)

From: Stephen D. Williams <sdw@lig.net>
Date: Fri, 19 Nov 2004 18:32:20 -0500
To: "Cutler, Roger (RogerCutler)" <RogerCutler@chevrontexaco.com>
Cc: Mike Champion <mc@xegesis.org>, Silvia.De.Castro.Garcia@esa.int, public-xml-binary@w3.org
Message-ID: <419E8284.6070108@lig.net>
One person's metadata is another person's data.
One person's obvious is another's revelation.
Logic is a system whereby one may go wrong with confidence.
I'll add to the possibly-obvious thread by rambling a bit on the 
representation of structure, metadata, and data.  Please correct any 
mistakes you see:


A few abstract scenarios may cover the level of detail needed to specify 
when deltas are appropriate:

    * Intra-instance redundancy
          o In the structure: In cases where there is a lot of
            redundancy or verbosity in the structure, tokenization,
            dictionaries, schema-based factoring, and deltas can make a
            big difference.
          o Obvious in the data: When there is a lot of redundancy in
            the data but little structural overhead, tokenization and
            schema-based factoring doesn't help much (although typed
            binary data, especially range-restricted, might) while
            dictionaries and delta coding can make a giant difference.
          o Hidden in the data: When the structure isn't significantly
            costly (i.e. few tags, large blocks of data) and the data is
            too random to delta, neither will help.  In this case you
            need a better encoding of the scalar data.
    * Inter-instance redundancy
          o Pure schema-based factoring does not take advantage of
            inter-instance redundancy.  The only redundancy that can be
            removed is the 'compile-time' redundancy which is pretty
            much limited to redundancy in representing the structure. 
            Only fixed redundancy in data can be removed, such as a
            restricted range: "integers between 0-120".
          o Delta encoding could remove a wide variety of redundancy in
            both structure and data in a very flexible way.
          o Inter-instance redundancy that is at a complex level, such
            as the difference between two frames of video, should not be
            solved by a general data format.
          o Deltas are appropriate any time that a parent
            document/object could be available and where changes in a
            format are represented by one or more linear ranges of
            bytes, bits, or similar that are deleted, appended,
            inserted, or replaced.

("Tokenization" is the representation of structure as efficient tokens, 
probably with length indication.  "Dictionaries" are tables of names or 
values that can be represented by multiple instances of a dictionary ID 
reference.  "Schema-based factoring" means using a schema specification 
to encode data without completely explicit structure+naming and/or with 
knowledge of metadata of types that supports more efficient encoding.)

In my view, there are levels to the use of an XML-like method for 
encoding data.  It is debatable how many of these should be in a 
base-layer specification and standard instead of possibly-narrower 
layered specifications.  For instance, topic-specific encoding of data 
such as image and video is best left to groups such as JPEG, MPEG, and 
others.  For simple arrays or grids of scalars, you could make a case 
for it being topic-specific or that it is a general problem that has a 
small number of common solutions.  An example might be to have an option 
to code consecutive scalars by difference with variable sized 
representation, as is done for certain simple lossless audio formats.

Structure / metadata / data conceptual representation levels:

    * Binary Structure: Envelope, preamble, overall structure of tags,
      nesting.  Efficiency probably by lengths, tokenizing,
      dictionaries.  Possible redundancy removal via use of schemas
      and/or deltas.
          o Data in default, character-based encoding a la XML 1.x.
    * Binary Data:
          o Data in binary scalar form in either single universal
            formats or in multiple alternatives (such as endianness)
            that are required to be handled by any endpoint ("reader
            makes right").
          o Data in special-case compressed or condensed form.
          o Data in application-specific private encoding, normally
            handled by a plugin of some kind.

Structure and data optimization are mostly separate areas, although it 
makes sense to allow for necessary metadata for flexible data encoding 
in the structure.  Problems like choosing a small set of ways to 
efficiently encode a list of scalars are not appropriate when solving 
the structure problems, although it should be allowed for.

I call my personal approach "Efficiency Structured XML" (formerly 
"Binary Structured XML") because when I started with "binary XML" or 
"efficient XML", many people thought mainly about having binary data and 
the problems related to that.  I feel that the structure must be 
conquered first and foremost as that has a broader impact while also 
being more difficult to get right.  For a while, I was persuaded by 
XML-Dev discussions that binary data might not be worth it, but enough 
use cases have a strong need that I now support the idea.

In the above, I elided some detail about metadata to avoid further 
muddling the discussion.

sdw

Cutler, Roger (RogerCutler) wrote:

>At the strong risk of stating the patently obvious -- it seems to me
>that how much good deltas are going to do depends a LOT on the usage
>scenario.  Of those with which I am personally familiar, I think it
>might be a huge winner in the one where you have Point of Sale
>information coming to backoffice systems from a large number of
>retailers, and would do absolutely no good whatsoever with seismic data.
>Seems to me that identifying which usage scenarios deltas are likely to
>be useful, and of course the potential impact of those use cases, would
>be a good idea. 
>
>-----Original Message-----
>From: public-xml-binary-request@w3.org
>[mailto:public-xml-binary-request@w3.org] On Behalf Of Stephen D.
>Williams
>Sent: Thursday, November 18, 2004 10:43 AM
>To: Mike Champion
>Cc: Silvia.De.Castro.Garcia@esa.int; public-xml-binary@w3.org
>Subject: Re: question: Increasing factor for XML vs Binary
>
>
>One thing that is missing from a lot of these analyses is what could be
>saved by being able to do deltas.  In a situation where there is any
>kind of repetition such as protocol messages (in XMPP), records of some
>kind in a stream or file, or a request/response, the ability to send
>only what's different efficiently may use less CPU and be more efficient
>than even schema-based solutions.
>
>I plan to benchmark and demonstrate this kind of solution soon.  There
>is a way to use the idea of a delta in a way that is very schema-like,
>but isn't so firmly tied to a schema.  Use in a 'header compression' 
>style is even more powerful although it is somewhat more entangled in
>the semantics of the application.
>
>sdw
>
>Mike Champion wrote:
>
>  
>
>>Sigh most of that was lost somewhere ... I'm on a handheld ...
>>
>>I'll interperet this as 'how much of a compression factor can be
>>    
>>
>achieved by using a binary vs XML encoding of the same data.'  The usual
>answer, I'm afraid: it depends.  As best I recall from a literature
>survey:
>  
>
>>larger docs compress better than small,
>>
>>you can get more compression if you use more CPU (and hence battery) 
>>power,
>>
>>you can get very good compression if you assume that the schema is
>>    
>>
>known to both sides and docs are valid instances,.
>  
>
>>My recollection is that 5:1 compression is realistic for arbitrary XML
>>    
>>
>and 10:1 and higher is feasible with shared schemas.
>  
>
>>-----Original Message-----
>>From:  Silvia.De.Castro.Garcia@esa.int
>>Date:  11/4/04 8:56 am
>>To:  public-xml-binary@w3.org
>>Subj:  question: Increasing factor for XML vs Binary
>>
>>Hi all,
>>       I would like to know the estimation order of the increasing 
>>factor for the XML format respect to the equivalent binary product, I 
>>mean, which is the order of the overload that will supose using XML 
>>instead of binary format?
>>
>>Thank you very much,
>>Best regards,
>>
>>Silvia de Castro.
>>
>>
>> 
>>
>>    
>>
>
>
>--
>swilliams@hpti.com http://www.hpti.com Per: sdw@lig.net http://sdw.st
>Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw
>
>
>
>
>  
>


-- 
swilliams@hpti.com http://www.hpti.com Per: sdw@lig.net http://sdw.st
Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw
Received on Friday, 19 November 2004 23:34:15 UTC