Re: TAG input to EXI WG on Efficient XML Interchange and EXI Measurements from Greg White on 2007-12-20 (public-exi@w3.org from December 2007)

From: Greg White <gwhite@stanford.edu>
Date: Thu, 20 Dec 2007 00:05:43 -0800
To: www-tag@w3.org
Cc: public-exi@w3.org, EXI EXI <member-exi-wg@w3.org>
Message-Id: <D538C783-AEED-4C86-B793-688C7281DEAF@stanford.edu>
Dear Henry, TAG, and everyone,

	Firstly, let me thank you on behalf of the Efficient XML Interchange  
working group, for your time and help during the TPAC meeting, and for  
these comments. Your remarks have been very useful on a number of  
fronts. In particular, we'll be addressing the need for both  
summarized and focused comparisons between EXI and alternatives,  
cooperatively positioning EXI, and working out appropriate methods for  
identifying EXI. We look forward very much to any help that can be  
offered, both in supervision and expertise.

With respect to point 1) "Making the Case for EXI"
==================================================
Yes, the measurement document is long and involved. It does not  
contain a simple aggregation of the results in order to present the  
expected performance of a future EXI format for broad groups of use  
cases, it does not contain the simple "round-trip" statistic that many  
readers might be looking for, the mobile use case was not measured  
(though it can be inferred), and the summary and conclusions sections  
are not as concise as they should be. Speaking for the editors, I can  
list these without too much fear of retribution. The reason was simply  
lack of time and resources. But it does contain a lot of well measured  
data, thanks to a good test framework, and a broad data set, from  
which compelling results for many use cases are evident. It's  
difficult to be concise when all quoted figures must be qualified by  
what exactly was being measured, but for instance, it's clear that  
decoding can be expected to be 5 times faster than XALS, the high  
performance processor with which comparison was made for processing  
efficiency [1]. That is an impressive statistic for retrievals of  
static data. With specialized compression, a new format could be  
expected to be close to twice as fast as XALS plus gzip; and when  
compaction is not used, about 3 times as fast in encoding and decoding  
combined as XALS [2] (184%). Additionally, the above do not take into  
account the emergent effect of compactness and therefore transmission  
time on interchange speed - and as processor speeds increase,  
compaction will increasingly dominate transaction throughput. This is  
evidenced in our measurements, where although the XALS XML parser is  
generally 2-3 times faster than JAXP, it fails to exceed JAXP in most  
cases over a 100mbps link, and in nearly all cases over the 54mbps and  
11mbps links.

	Still, the tenor of the TAG's remarks are of course right and we will  
take these steps to address them in a new summary document:

	1) Make a simple comparison to GZIP as suggested by TAG
	2) Add a round-trip measurement, or at least a summation of the  
separate encoding and decoding network measurement results we already  
have. Putting this in the context of a Web Service would add  
systematics from the higher layers of the stack so should probably be  
avoided
	3) Add a specific measurements of the Mobile use case (though not in  
the framework). It is a good idea to accompany this with a use cases'  
individual report on the comparative benefit of binary formats, as  
structured by the TAG comment
	4) A bullet point list of measurement findings, and consequent  
expected performance of an EXI format.

The above will be delivered individually to the TAG and XML Core, and  
to the public EXI mailing list, as they're completed, in addition to  
the summary paper.


TAG made the remark that:

> EXI is unlikely to be widely available with a W3C-approved REC  
> behind it for several years.

A number of EXI implementations are in development now, both inside  
and outside the WG. One vendor has a number of implementations which  
are spec complete with respect to our second draft. Open EXI is spec  
complete with respect to our first draft, and expects to be public  
next month. It would also be reasonable to assume that there are some  
developments going on that are not publicly acknowledged.  Some of  
those we know, are progressed enough that we believe they could be  
finalized during the Last Call period (which is a requirement of our  
Charter for exiting LC). Through use of proxies and suitable client  
libraries we can easily see EXI based solutions going live within  
about 6 months of publishing a Recommendation.


With respect to point 2) Positioning EXI going forward
======================================================
TAG makes the comment:

> The value of XML interoperability is enormous.  If EXI goes forward,
> it must do so in a way which does not compromise this.

The EXI working group agrees, absolutely.

Additionally, the TAG note the evident need to communicate where EXI  
might be considered useful, and where not:

> The WG needs
> to take every opportunity to get the message across that EXI is
> primarily aimed at a set of core use-cases, where one-off binary
> formats and/or lower-performance generic compression techniques are
> already in use.

The data also show the scope of applicability of an efficient encoding  
format. From the regression tables, users of EXI in the use cases  
Finance, Military, Scientific and Storage, could expect to see  
performance improvements of between factor 4 and factor 8 [3]. Others,  
like Broadcast, and Document, would not, and that should be made clear  
in positioning EXI. Document size, and also the proportion of data to  
markup also predict performance: low data density documents (except  
the very small ones) can expect factor 3 to 7 increases in interchange  
performance. On 54Mbit and 11Mbit networks, if the document sizes are  
large (so <100 TPS), a factor 10 performance increase can be expected  
when specialized compression is used (though the graphs show enormous  
variance around this center). On such networks, as documents get so  
small that TPS is presently >100 using XML and gzip, EXI would give  
concomitantly less benefit.

Some rules of thumb like these above should be added to the Best  
Practices, and the new document to be prepared in response to TAG (see  
4) above), should additionally include some hints for potential users  
of EXI to determine whether they fit into the limited profile of  
people who might benefit.

It may be a pedantic point because I'm not sure it was meant in an  
absolute sense, but confining EXI to those "core use-case where one- 
off binary formats and/or lower-performance generic compression  
techniques are already in use," is probably unachievable. This is  
because *existing* users of "textual" XML are a self-selecting group;  
XML has specifically selected out those cases where the encoding and  
processing tools can not accommodate the use case even when the other  
facets of XML would have made it attractive. Secondly, the existing  
users of binary formats, are those for whom the benefit was so great  
that the high cost of developing the binary format themselves, or  
buying from a bespoke vendor, was overcome. It's therefore reasonable  
to expect that if EXI lowers the cost of high performance, it will  
become sufficiently attractive for some borderline existing XML use- 
cases to adopt it, as alluded to in your comments above regarding the  
mobile case. EXI would help those user communities for which it's  
intended, and not the others.


With respect to public positioning of EXI, TAG cautioned that:

> _No_ aspect of the public presentation of EXI should
> suggest that generic XML tools will or ought to evolve to read or
> produce the EXI format.
>
> If the EXI WG agrees with this perspective on positioning, that TAG
> will be happy to assist in any way it can to promote it.  If the EXI
> WG does _not_ agree, further discussion is needed urgently to try to
> find a position which both parties can agree to.

I think the substance of this remark is not dispute from the EXI WG,  
it's just in literal interpretation it is problematic; it seems more  
reasonable for the WG to acknowledge that if EXI becomes a popular  
format, then some tools that exist in name now, will probably be  
motivated to integrate support for it. So, please do help us to decide  
on the best methods of EXI identification and integration, and  
consequently to promote it, that we started so fruitfully at the  
Technical Plenary meeting.


Some other Remarks
==================

Regarding the question of whether EXI's potential disruption to XML  
should be justified given the expectation of Moore's Law:
	Firstly, as alluded to by the TAG, bandwidth and other resources do  
not follow Moore's Law. Therefore, as processor speeds increase,  
bandwidth will become the Amdahl-like limiting factor, so in  
interchange scenarios at least, it is compactness that will become  
increasingly important for improving throughput.
	Furthermore, Moore's Law only characterizes the relationship between  
time and processing speed (ok, literally between time and number of  
transistors). That is, in economic terms it characterizes the observed  
relationship between time and supply, not between time and demand. We  
can be assured that the commodity sure to outpace processing speed,  
throughput speed, storage capacity, or power consumption, will be  
people's demand for data.
	
Screamer gets close to the theoretical limit of a processor to process  
a bitstream. EXI gets close to the theoretical limit of a bitstream to  
describe information. The highest performing architecture would  
combine the process phase integration and encoding approaches together.

Taking the point about confining EXI to the present scope of binary  
formats and the implicit assumption in the template given by TAG for  
answering the propositions 1a and 1b - that EXI addresses only  
*existing* use cases, to enable an incremental expansion of the  
possible uses of structured data. But EXI additionally addresses the  
next wave of possibility, especially if one considers EXI as a compact  
Infoset data format: a Brazilian medical researcher downloads a PUBMED  
file to their PDA before making a trip into the country to find active  
compounds; a doctor is assisted by a future expert system that can  
scan a binary indexed RDF knowledge base, while making rounds; a  
physicist can download a week of data from an accelerator to their  
flash memory stick and take it home with them for analysis. The Open  
Grid Service Architecture could include the data itself, and  
references to points within the data, in their description of  
distributed computing jobs, rather than only high level descriptions  
of jobs as now. These are unverified scenarios, but reasonably within  
the target range of the combination of EXI and technology integration  
like Screamer.

It is already a cliché that data is a bedrock economic resource. The  
ability to ship it and process it will become only increasingly key to  
our lives. The syntax and semantics of XML personifies a near ideal  
format for symbolic data (as opposed to representational data like  
video, audio etc), since it encodes the idea of hierarchy, meta-data,  
association, schema etc. It's reasonable to assume that these "stones"  
are sufficient to describe the data of the majority of problems well  
into the future.


Greg White,
for the EXI Working Group - an exceptional team having great success  
at making something that could really touch many, many, people.

[1] 9.1.3 Processing Efficiency Analysis Details, http://www.w3.org/TR/exi-measurements/#Ax-details-pe-summary
[2] 6.2.1 Processing Efficiency Summary, http://www.w3.org/TR/exi-measurements/#results-pe-summary 
. When checking this, bear in mind that quoted figures use the nominal  
technology (XML.gz and xals) as the baseline for percentages changes  
and the necessity to average percentages geometrically.
[3] See tabular results in 9.1.5.1 of Network Processing Efficiency  
Analysis Details, http://www.w3.org/TR/exi-measurements/#Ax-details-network-summary 
.

On Nov 29, 2007, at 11:34 AM, Henry S. Thompson wrote:

>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Following on from the many useful discussions at group and individual
> level which took place during the recent TPAC week, the TAG has
> arrived at the following requests and recommendations for the
> Efficient XML Interchange Working Group:
>
> 1) Making the case for EXI
>
> On the face of it two propositions need to be confirmed in order for
> the proposed EXI Format to go ahead to REC:
>
>  1a) The proposed EXI format is the best available choice for the job;
>  1b) The performance of the proposed EXI format makes it worth going
>      ahead.
>
> The _Efficient XML Interchange Measurements Note_ has a huge amount of
> data in it, but it is _very_ difficult to digest.  On the face of it,
> it is difficult to determine the extent to which it confirms either
> (1a) or (1b).  A detailed set of comments on this document is provided
> at the end of this message.
>
> What is needed is an summary analysis of the measurement results which
> specifically addresses (1a) and (1b) above.  With respect to (1a),
> this would presumably have something like the following form:
>
> EXI is better than GZIP because, as shown in tables xxx and yyy, it
> produces comparable compression with zzz% less time to compress and
> ZZZ% less time to decompress and parse for documents in classes P, Q
> and R;
>
> EXI is better than XMill because. . .
>
> and so on.
>
> With respect to (1b), as the TAG suggested in its feedback [1] on the
> outputs of the XML Binary Characterization WG, "concrete targets
> should be set for the size and/or speed gains that would be needed to
> justify the disruption introduced by a new format".  It's too late to
> set those targets in advance -- what is required now is that a small
> number of key use cases be identified, the expected benefits of EXI
> for those use cases be quantified, and concrete evidence cited that
> those benefits are a) not liable to be seriously eroded by the onward
> rush of Moore's law and friends and b) sufficient to make adoption
> likely, bearing in mind that EXI is unlikely to be widely available
> with a W3C-approved REC behind it for several years.
>
> The use-case we have heard most commonly nominated to fill this role
> is delivery of data-intensive web-services to mobile devices, where
> computationally intensive processes place heavy demands on power
> supply, and battery technology is _not_ advancing at anything like
> Moore's law rates.  Documenting and quantifying this story, if true,
> would be a very good start.  Statements of requirements from potential
> deployers would be particularly useful, that is, getting a mobile
> device manufacturer to say "We will deploy service X on devices of
> class Y iff we can get Z amount of XML data off the wire and into
> memory in less than M milliseconds using an N MHz processor.  The best
> we project with current technologies is MM msec using an NN MHz
> processor", accompanied, of course, by a demonstration that EXI could
> actually meet that requirement.
>
> 2) Positioning EXI going forward
>
> The value of XML interoperability is enormous.  If EXI goes forward,
> it must do so in a way which does not compromise this.  The WG needs
> to take every opportunity to get the message across that EXI is
> primarily aimed at a set of core use-cases, where one-off binary
> formats and/or lower-performance generic compression techniques are
> already in use.  _No_ aspect of the public presentation of EXI should
> suggest that generic XML tools will or ought to evolve to read or
> produce the EXI format.
>
> If the EXI WG agrees with this perspective on positioning, that TAG
> will be happy to assist in any way it can to promote it.  If the EXI
> WG does _not_ agree, further discussion is needed urgently to try to
> find a position which both parties can agree to.
>
> - -------------
>
> Detailed comments on the Measurement documents
>
> The 2005 TAG message called for testing to involve "best possible
> text-based XML 1.x implementations", and the XBC's call for
> implementations [2] specifically called for only XML parsers, not for
> the surrounding application or middleware stacks, JDKs or Java Virtual
> Machines.  The benchmarks have not been against the best possible
> text-based XML 1.x implementations.
>
> The measurements document acknowledges the issue of stack integration
> in "Stack integration considers the full XML processing system, not
> just the parser. By selectively combining the components of the
> processing stack through abstract APIs, the system can directly
> produce application data from the bytes that were read. Two prominent
> examples of this technique are [Screamer [3] and [EngelenGSOAP
> [4]]. Both of these can also be called schema-derived as they compile
> a schema into code. However, neither simply generates a generic
> parser, but rather a full stack for converting between application
> data and serialized XML. This gives a significant improvement compared
> to just applying the pre-compilation to the parsing layer."  But
> neither of these prominent examples appears in the test data.
>
> Further, there were no "real-world end to end" use cases tested, such
> as a Web service application, a mobile application, etc.  Thus we do
> not know the overall effect of any particular technology on overall
> application performance.
>
> The measurements document states "To begin with, the XBC
> Characterization Measurement Methodologies Note [5] defines thresholds
> for whether a candidate format achieves sufficient compactness " [5].
> But in the XBC Characterization Measurement Methodologies Note [5]
> itself we find no threshholds: "Because XML documents exist with a
> wide variety of sizes, structures, schemas, and regularity, it is not
> possible to define a single size threshold or percentage compactness
> that an XML format must achieve to be considered sufficiently compact
> for a general purpose W3C standard."
>
> We attempted to determine the differences between Efficient XML and
> Gzip but found the methodology confusing.  The measurements document
> specifies that "In the Document [...] and Both [...]  classes,
> candidates are compared against gzipped XML, while in the Neither
> [...] and Schema [...]  cases, the comparison was to plain XML".
> Examining Document and Both compactness graphs, Gzip appears to offer
> improvements over XML that track the other implementations, with the
> noteworthy point that Efficient XML's improvements over Gzip are
> significant in a significant part of the Both chart but similar in the
> Document.  Examining Processing Efficiency graphs, it appears as
> though XML is clearly superior in Java Encoding in Document and Both.
> GZip appears further inferior but yet all solutions vary wildly around
> XML in Decoding Document and Both.  A worrying statement is "An
> interesting point to note in the decoding results is the shapes of the
> graphs for each individual candidate. Namely, these appear similar to
> each other, containing similar peaks and troughs. Even more
> interestingly, this is also the case with Xals, indicating that there
> is some feature of the JAXP parser that is implemented suboptimally
> and triggered by a subset of the test documents."  The measurements
> document states "For instance, preliminary measurements in the EXI
> framework indicate that the default parser shipped with Java improved
> noticeably from version 5.0 to version 6, showing 2-3-fold improvement
> for some cases.", and the measurements used JDK 1.5.0_05-b05 for Java
> based parsing and JDK 1.6.0_02-ea-b01 for native.  Perhaps an improved
> JDK, Java Virtual Machine, or virtualized JVM would further improve
> results.  These leads us to wonder whether a combination GZip with
> improved technologies such as Parsers, JDKs, VMs, or even Stack
> Integration technology (that is Schema aware and hence covered under
> Both and Schema) would suffice for the community.
>
> Examining the data sets used, there are a number of military
> applications (ASMTF, AVCL, JTLM) and yet comparatively few generic
> "Web service" applications.  The Google test suite lists Web services
> for small devices and Web services routing; the Invoice test suite
> lists Intra/InterBusiness Communication which immediately limits its
> scope to "A large business communicates via XML with a number of
> remote businesses, some of which can be small business partners. These
> remote or small businesses often have access only to slow transmission
> lines and have limited hardware and technical expertise."; and there
> is a WSDL test suite.  This seems to avoid the "common" Web service
> case of the many Web APIs provided by hosted solutions like Google,
> Yahoo, Amazon, eBay, Salesforce, Facebook, MSN, etc.  Examining the
> test data shows that the Google test cases used 5 different test cases
> (0,7,15, 24,30) which includes 1 soap fault (case #24).  There are 2
> AVCL, 5 Invoice, 8 Location Sightings, 6 JTLM, 5 ASMTF, 2 WSDL test
> cases as well.  There appears to be broad based coverage of each,
> though the rationale for the various weightings aren't documented.
> For example, why 4 Google "success cases" and 2 WSDL cases?  Surely
> there are more than 2 times as many SOAP messages than WSDL messages
> being sent around the internet.
>
> Henry S. Thompson
> on behalf of the TAG
>
> [1] http://lists.w3.org/Archives/Public/public-xml-binary/2005May/0000.html
> [2] http://lists.w3.org/Archives/Public/public-exi/2006Mar/0004.html
> [3] http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#ref-screamer
> [4] http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#ref-Engelen-gsoap
> [5] http://www.w3.org/TR/xbc-measurement/
> - --
> Henry S. Thompson, HCRC Language Technology Group, University of  
> Edinburgh
>                     Half-time member of W3C Team
>    2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
>            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
>                   URL: http://www.ltg.ed.ac.uk/~ht/
> [mail really from me _always_ has this .sig -- mail without it is  
> forged spam]
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.6 (GNU/Linux)
>
> iD8DBQFHTxRwkjnJixAXWBoRAl75AJ9CjNLK8JXvb7fqlZS0UwClszs6UQCfaryY
> eI9+DHa7jeMgRWR22O0Wjx0=
> =/JH5
> -----END PGP SIGNATURE-----
>
Received on Thursday, 20 December 2007 08:06:09 UTC