- From: Stephen D. Williams <sdw@lig.net>
- Date: Fri, 19 Nov 2004 18:32:20 -0500
- To: "Cutler, Roger (RogerCutler)" <RogerCutler@chevrontexaco.com>
- Cc: Mike Champion <mc@xegesis.org>, Silvia.De.Castro.Garcia@esa.int, public-xml-binary@w3.org
- Message-ID: <419E8284.6070108@lig.net>
One person's metadata is another person's data. One person's obvious is another's revelation. Logic is a system whereby one may go wrong with confidence. I'll add to the possibly-obvious thread by rambling a bit on the representation of structure, metadata, and data. Please correct any mistakes you see: A few abstract scenarios may cover the level of detail needed to specify when deltas are appropriate: * Intra-instance redundancy o In the structure: In cases where there is a lot of redundancy or verbosity in the structure, tokenization, dictionaries, schema-based factoring, and deltas can make a big difference. o Obvious in the data: When there is a lot of redundancy in the data but little structural overhead, tokenization and schema-based factoring doesn't help much (although typed binary data, especially range-restricted, might) while dictionaries and delta coding can make a giant difference. o Hidden in the data: When the structure isn't significantly costly (i.e. few tags, large blocks of data) and the data is too random to delta, neither will help. In this case you need a better encoding of the scalar data. * Inter-instance redundancy o Pure schema-based factoring does not take advantage of inter-instance redundancy. The only redundancy that can be removed is the 'compile-time' redundancy which is pretty much limited to redundancy in representing the structure. Only fixed redundancy in data can be removed, such as a restricted range: "integers between 0-120". o Delta encoding could remove a wide variety of redundancy in both structure and data in a very flexible way. o Inter-instance redundancy that is at a complex level, such as the difference between two frames of video, should not be solved by a general data format. o Deltas are appropriate any time that a parent document/object could be available and where changes in a format are represented by one or more linear ranges of bytes, bits, or similar that are deleted, appended, inserted, or replaced. ("Tokenization" is the representation of structure as efficient tokens, probably with length indication. "Dictionaries" are tables of names or values that can be represented by multiple instances of a dictionary ID reference. "Schema-based factoring" means using a schema specification to encode data without completely explicit structure+naming and/or with knowledge of metadata of types that supports more efficient encoding.) In my view, there are levels to the use of an XML-like method for encoding data. It is debatable how many of these should be in a base-layer specification and standard instead of possibly-narrower layered specifications. For instance, topic-specific encoding of data such as image and video is best left to groups such as JPEG, MPEG, and others. For simple arrays or grids of scalars, you could make a case for it being topic-specific or that it is a general problem that has a small number of common solutions. An example might be to have an option to code consecutive scalars by difference with variable sized representation, as is done for certain simple lossless audio formats. Structure / metadata / data conceptual representation levels: * Binary Structure: Envelope, preamble, overall structure of tags, nesting. Efficiency probably by lengths, tokenizing, dictionaries. Possible redundancy removal via use of schemas and/or deltas. o Data in default, character-based encoding a la XML 1.x. * Binary Data: o Data in binary scalar form in either single universal formats or in multiple alternatives (such as endianness) that are required to be handled by any endpoint ("reader makes right"). o Data in special-case compressed or condensed form. o Data in application-specific private encoding, normally handled by a plugin of some kind. Structure and data optimization are mostly separate areas, although it makes sense to allow for necessary metadata for flexible data encoding in the structure. Problems like choosing a small set of ways to efficiently encode a list of scalars are not appropriate when solving the structure problems, although it should be allowed for. I call my personal approach "Efficiency Structured XML" (formerly "Binary Structured XML") because when I started with "binary XML" or "efficient XML", many people thought mainly about having binary data and the problems related to that. I feel that the structure must be conquered first and foremost as that has a broader impact while also being more difficult to get right. For a while, I was persuaded by XML-Dev discussions that binary data might not be worth it, but enough use cases have a strong need that I now support the idea. In the above, I elided some detail about metadata to avoid further muddling the discussion. sdw Cutler, Roger (RogerCutler) wrote: >At the strong risk of stating the patently obvious -- it seems to me >that how much good deltas are going to do depends a LOT on the usage >scenario. Of those with which I am personally familiar, I think it >might be a huge winner in the one where you have Point of Sale >information coming to backoffice systems from a large number of >retailers, and would do absolutely no good whatsoever with seismic data. >Seems to me that identifying which usage scenarios deltas are likely to >be useful, and of course the potential impact of those use cases, would >be a good idea. > >-----Original Message----- >From: public-xml-binary-request@w3.org >[mailto:public-xml-binary-request@w3.org] On Behalf Of Stephen D. >Williams >Sent: Thursday, November 18, 2004 10:43 AM >To: Mike Champion >Cc: Silvia.De.Castro.Garcia@esa.int; public-xml-binary@w3.org >Subject: Re: question: Increasing factor for XML vs Binary > > >One thing that is missing from a lot of these analyses is what could be >saved by being able to do deltas. In a situation where there is any >kind of repetition such as protocol messages (in XMPP), records of some >kind in a stream or file, or a request/response, the ability to send >only what's different efficiently may use less CPU and be more efficient >than even schema-based solutions. > >I plan to benchmark and demonstrate this kind of solution soon. There >is a way to use the idea of a delta in a way that is very schema-like, >but isn't so firmly tied to a schema. Use in a 'header compression' >style is even more powerful although it is somewhat more entangled in >the semantics of the application. > >sdw > >Mike Champion wrote: > > > >>Sigh most of that was lost somewhere ... I'm on a handheld ... >> >>I'll interperet this as 'how much of a compression factor can be >> >> >achieved by using a binary vs XML encoding of the same data.' The usual >answer, I'm afraid: it depends. As best I recall from a literature >survey: > > >>larger docs compress better than small, >> >>you can get more compression if you use more CPU (and hence battery) >>power, >> >>you can get very good compression if you assume that the schema is >> >> >known to both sides and docs are valid instances,. > > >>My recollection is that 5:1 compression is realistic for arbitrary XML >> >> >and 10:1 and higher is feasible with shared schemas. > > >>-----Original Message----- >>From: Silvia.De.Castro.Garcia@esa.int >>Date: 11/4/04 8:56 am >>To: public-xml-binary@w3.org >>Subj: question: Increasing factor for XML vs Binary >> >>Hi all, >> I would like to know the estimation order of the increasing >>factor for the XML format respect to the equivalent binary product, I >>mean, which is the order of the overload that will supose using XML >>instead of binary format? >> >>Thank you very much, >>Best regards, >> >>Silvia de Castro. >> >> >> >> >> >> > > >-- >swilliams@hpti.com http://www.hpti.com Per: sdw@lig.net http://sdw.st >Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw > > > > > > -- swilliams@hpti.com http://www.hpti.com Per: sdw@lig.net http://sdw.st Stephen D. Williams 703-724-0118W 703-995-0407Fax 20147-4622 AIM: sdw
Received on Friday, 19 November 2004 23:34:15 UTC