- From: Stephen D. Williams <sdw@lig.net>
- Date: Fri, 19 Nov 2004 18:32:20 -0500
- To: "Cutler, Roger (RogerCutler)" <RogerCutler@chevrontexaco.com>
- Cc: Mike Champion <mc@xegesis.org>, Silvia.De.Castro.Garcia@esa.int, public-xml-binary@w3.org
- Message-ID: <419E8284.6070108@lig.net>
One person's metadata is another person's data.
One person's obvious is another's revelation.
Logic is a system whereby one may go wrong with confidence.
I'll add to the possibly-obvious thread by rambling a bit on the
representation of structure, metadata, and data. Please correct any
mistakes you see:
A few abstract scenarios may cover the level of detail needed to specify
when deltas are appropriate:
* Intra-instance redundancy
o In the structure: In cases where there is a lot of
redundancy or verbosity in the structure, tokenization,
dictionaries, schema-based factoring, and deltas can make a
big difference.
o Obvious in the data: When there is a lot of redundancy in
the data but little structural overhead, tokenization and
schema-based factoring doesn't help much (although typed
binary data, especially range-restricted, might) while
dictionaries and delta coding can make a giant difference.
o Hidden in the data: When the structure isn't significantly
costly (i.e. few tags, large blocks of data) and the data is
too random to delta, neither will help. In this case you
need a better encoding of the scalar data.
* Inter-instance redundancy
o Pure schema-based factoring does not take advantage of
inter-instance redundancy. The only redundancy that can be
removed is the 'compile-time' redundancy which is pretty
much limited to redundancy in representing the structure.
Only fixed redundancy in data can be removed, such as a
restricted range: "integers between 0-120".
o Delta encoding could remove a wide variety of redundancy in
both structure and data in a very flexible way.
o Inter-instance redundancy that is at a complex level, such
as the difference between two frames of video, should not be
solved by a general data format.
o Deltas are appropriate any time that a parent
document/object could be available and where changes in a
format are represented by one or more linear ranges of
bytes, bits, or similar that are deleted, appended,
inserted, or replaced.
("Tokenization" is the representation of structure as efficient tokens,
probably with length indication. "Dictionaries" are tables of names or
values that can be represented by multiple instances of a dictionary ID
reference. "Schema-based factoring" means using a schema specification
to encode data without completely explicit structure+naming and/or with
knowledge of metadata of types that supports more efficient encoding.)
In my view, there are levels to the use of an XML-like method for
encoding data. It is debatable how many of these should be in a
base-layer specification and standard instead of possibly-narrower
layered specifications. For instance, topic-specific encoding of data
such as image and video is best left to groups such as JPEG, MPEG, and
others. For simple arrays or grids of scalars, you could make a case
for it being topic-specific or that it is a general problem that has a
small number of common solutions. An example might be to have an option
to code consecutive scalars by difference with variable sized
representation, as is done for certain simple lossless audio formats.
Structure / metadata / data conceptual representation levels:
* Binary Structure: Envelope, preamble, overall structure of tags,
nesting. Efficiency probably by lengths, tokenizing,
dictionaries. Possible redundancy removal via use of schemas
and/or deltas.
o Data in default, character-based encoding a la XML 1.x.
* Binary Data:
o Data in binary scalar form in either single universal
formats or in multiple alternatives (such as endianness)
that are required to be handled by any endpoint ("reader
makes right").
o Data in special-case compressed or condensed form.
o Data in application-specific private encoding, normally
handled by a plugin of some kind.
Structure and data optimization are mostly separate areas, although it
makes sense to allow for necessary metadata for flexible data encoding
in the structure. Problems like choosing a small set of ways to
efficiently encode a list of scalars are not appropriate when solving
the structure problems, although it should be allowed for.
I call my personal approach "Efficiency Structured XML" (formerly
"Binary Structured XML") because when I started with "binary XML" or
"efficient XML", many people thought mainly about having binary data and
the problems related to that. I feel that the structure must be
conquered first and foremost as that has a broader impact while also
being more difficult to get right. For a while, I was persuaded by
XML-Dev discussions that binary data might not be worth it, but enough
use cases have a strong need that I now support the idea.
In the above, I elided some detail about metadata to avoid further
muddling the discussion.
sdw
Cutler, Roger (RogerCutler) wrote:
>At the strong risk of stating the patently obvious -- it seems to me
>that how much good deltas are going to do depends a LOT on the usage
>scenario. Of those with which I am personally familiar, I think it
>might be a huge winner in the one where you have Point of Sale
>information coming to backoffice systems from a large number of
>retailers, and would do absolutely no good whatsoever with seismic data.
>Seems to me that identifying which usage scenarios deltas are likely to
>be useful, and of course the potential impact of those use cases, would
>be a good idea.
>
>-----Original Message-----
>From: public-xml-binary-request@w3.org
>[mailto:public-xml-binary-request@w3.org] On Behalf Of Stephen D.
>Williams
>Sent: Thursday, November 18, 2004 10:43 AM
>To: Mike Champion
>Cc: Silvia.De.Castro.Garcia@esa.int; public-xml-binary@w3.org
>Subject: Re: question: Increasing factor for XML vs Binary
>
>
>One thing that is missing from a lot of these analyses is what could be
>saved by being able to do deltas. In a situation where there is any
>kind of repetition such as protocol messages (in XMPP), records of some
>kind in a stream or file, or a request/response, the ability to send
>only what's different efficiently may use less CPU and be more efficient
>than even schema-based solutions.
>
>I plan to benchmark and demonstrate this kind of solution soon. There
>is a way to use the idea of a delta in a way that is very schema-like,
>but isn't so firmly tied to a schema. Use in a 'header compression'
>style is even more powerful although it is somewhat more entangled in
>the semantics of the application.
>
>sdw
>
>Mike Champion wrote:
>
>
>
>>Sigh most of that was lost somewhere ... I'm on a handheld ...
>>
>>I'll interperet this as 'how much of a compression factor can be
>>
>>
>achieved by using a binary vs XML encoding of the same data.' The usual
>answer, I'm afraid: it depends. As best I recall from a literature
>survey:
>
>
>>larger docs compress better than small,
>>
>>you can get more compression if you use more CPU (and hence battery)
>>power,
>>
>>you can get very good compression if you assume that the schema is
>>
>>
>known to both sides and docs are valid instances,.
>
>
>>My recollection is that 5:1 compression is realistic for arbitrary XML
>>
>>
>and 10:1 and higher is feasible with shared schemas.
>
>
>>-----Original Message-----
>>From: Silvia.De.Castro.Garcia@esa.int
>>Date: 11/4/04 8:56 am
>>To: public-xml-binary@w3.org
>>Subj: question: Increasing factor for XML vs Binary
>>
>>Hi all,
>> I would like to know the estimation order of the increasing
>>factor for the XML format respect to the equivalent binary product, I
>>mean, which is the order of the overload that will supose using XML
>>instead of binary format?
>>
>>Thank you very much,
>>Best regards,
>>
>>Silvia de Castro.
>>
>>
>>
>>
>>
>>
>
>
>--
>swilliams@hpti.com http://www.hpti.com Per: sdw@lig.net http://sdw.st
>Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw
>
>
>
>
>
>
--
swilliams@hpti.com http://www.hpti.com Per: sdw@lig.net http://sdw.st
Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw
Received on Friday, 19 November 2004 23:34:15 UTC