- From: Greg White <gwhite@stanford.edu>
- Date: Thu, 20 Dec 2007 00:05:43 -0800
- To: www-tag@w3.org
- Cc: public-exi@w3.org, EXI EXI <member-exi-wg@w3.org>
- Message-Id: <D538C783-AEED-4C86-B793-688C7281DEAF@stanford.edu>
Dear Henry, TAG, and everyone, Firstly, let me thank you on behalf of the Efficient XML Interchange working group, for your time and help during the TPAC meeting, and for these comments. Your remarks have been very useful on a number of fronts. In particular, we'll be addressing the need for both summarized and focused comparisons between EXI and alternatives, cooperatively positioning EXI, and working out appropriate methods for identifying EXI. We look forward very much to any help that can be offered, both in supervision and expertise. With respect to point 1) "Making the Case for EXI" ================================================== Yes, the measurement document is long and involved. It does not contain a simple aggregation of the results in order to present the expected performance of a future EXI format for broad groups of use cases, it does not contain the simple "round-trip" statistic that many readers might be looking for, the mobile use case was not measured (though it can be inferred), and the summary and conclusions sections are not as concise as they should be. Speaking for the editors, I can list these without too much fear of retribution. The reason was simply lack of time and resources. But it does contain a lot of well measured data, thanks to a good test framework, and a broad data set, from which compelling results for many use cases are evident. It's difficult to be concise when all quoted figures must be qualified by what exactly was being measured, but for instance, it's clear that decoding can be expected to be 5 times faster than XALS, the high performance processor with which comparison was made for processing efficiency [1]. That is an impressive statistic for retrievals of static data. With specialized compression, a new format could be expected to be close to twice as fast as XALS plus gzip; and when compaction is not used, about 3 times as fast in encoding and decoding combined as XALS [2] (184%). Additionally, the above do not take into account the emergent effect of compactness and therefore transmission time on interchange speed - and as processor speeds increase, compaction will increasingly dominate transaction throughput. This is evidenced in our measurements, where although the XALS XML parser is generally 2-3 times faster than JAXP, it fails to exceed JAXP in most cases over a 100mbps link, and in nearly all cases over the 54mbps and 11mbps links. Still, the tenor of the TAG's remarks are of course right and we will take these steps to address them in a new summary document: 1) Make a simple comparison to GZIP as suggested by TAG 2) Add a round-trip measurement, or at least a summation of the separate encoding and decoding network measurement results we already have. Putting this in the context of a Web Service would add systematics from the higher layers of the stack so should probably be avoided 3) Add a specific measurements of the Mobile use case (though not in the framework). It is a good idea to accompany this with a use cases' individual report on the comparative benefit of binary formats, as structured by the TAG comment 4) A bullet point list of measurement findings, and consequent expected performance of an EXI format. The above will be delivered individually to the TAG and XML Core, and to the public EXI mailing list, as they're completed, in addition to the summary paper. TAG made the remark that: > EXI is unlikely to be widely available with a W3C-approved REC > behind it for several years. A number of EXI implementations are in development now, both inside and outside the WG. One vendor has a number of implementations which are spec complete with respect to our second draft. Open EXI is spec complete with respect to our first draft, and expects to be public next month. It would also be reasonable to assume that there are some developments going on that are not publicly acknowledged. Some of those we know, are progressed enough that we believe they could be finalized during the Last Call period (which is a requirement of our Charter for exiting LC). Through use of proxies and suitable client libraries we can easily see EXI based solutions going live within about 6 months of publishing a Recommendation. With respect to point 2) Positioning EXI going forward ====================================================== TAG makes the comment: > The value of XML interoperability is enormous. If EXI goes forward, > it must do so in a way which does not compromise this. The EXI working group agrees, absolutely. Additionally, the TAG note the evident need to communicate where EXI might be considered useful, and where not: > The WG needs > to take every opportunity to get the message across that EXI is > primarily aimed at a set of core use-cases, where one-off binary > formats and/or lower-performance generic compression techniques are > already in use. The data also show the scope of applicability of an efficient encoding format. From the regression tables, users of EXI in the use cases Finance, Military, Scientific and Storage, could expect to see performance improvements of between factor 4 and factor 8 [3]. Others, like Broadcast, and Document, would not, and that should be made clear in positioning EXI. Document size, and also the proportion of data to markup also predict performance: low data density documents (except the very small ones) can expect factor 3 to 7 increases in interchange performance. On 54Mbit and 11Mbit networks, if the document sizes are large (so <100 TPS), a factor 10 performance increase can be expected when specialized compression is used (though the graphs show enormous variance around this center). On such networks, as documents get so small that TPS is presently >100 using XML and gzip, EXI would give concomitantly less benefit. Some rules of thumb like these above should be added to the Best Practices, and the new document to be prepared in response to TAG (see 4) above), should additionally include some hints for potential users of EXI to determine whether they fit into the limited profile of people who might benefit. It may be a pedantic point because I'm not sure it was meant in an absolute sense, but confining EXI to those "core use-case where one- off binary formats and/or lower-performance generic compression techniques are already in use," is probably unachievable. This is because *existing* users of "textual" XML are a self-selecting group; XML has specifically selected out those cases where the encoding and processing tools can not accommodate the use case even when the other facets of XML would have made it attractive. Secondly, the existing users of binary formats, are those for whom the benefit was so great that the high cost of developing the binary format themselves, or buying from a bespoke vendor, was overcome. It's therefore reasonable to expect that if EXI lowers the cost of high performance, it will become sufficiently attractive for some borderline existing XML use- cases to adopt it, as alluded to in your comments above regarding the mobile case. EXI would help those user communities for which it's intended, and not the others. With respect to public positioning of EXI, TAG cautioned that: > _No_ aspect of the public presentation of EXI should > suggest that generic XML tools will or ought to evolve to read or > produce the EXI format. > > If the EXI WG agrees with this perspective on positioning, that TAG > will be happy to assist in any way it can to promote it. If the EXI > WG does _not_ agree, further discussion is needed urgently to try to > find a position which both parties can agree to. I think the substance of this remark is not dispute from the EXI WG, it's just in literal interpretation it is problematic; it seems more reasonable for the WG to acknowledge that if EXI becomes a popular format, then some tools that exist in name now, will probably be motivated to integrate support for it. So, please do help us to decide on the best methods of EXI identification and integration, and consequently to promote it, that we started so fruitfully at the Technical Plenary meeting. Some other Remarks ================== Regarding the question of whether EXI's potential disruption to XML should be justified given the expectation of Moore's Law: Firstly, as alluded to by the TAG, bandwidth and other resources do not follow Moore's Law. Therefore, as processor speeds increase, bandwidth will become the Amdahl-like limiting factor, so in interchange scenarios at least, it is compactness that will become increasingly important for improving throughput. Furthermore, Moore's Law only characterizes the relationship between time and processing speed (ok, literally between time and number of transistors). That is, in economic terms it characterizes the observed relationship between time and supply, not between time and demand. We can be assured that the commodity sure to outpace processing speed, throughput speed, storage capacity, or power consumption, will be people's demand for data. Screamer gets close to the theoretical limit of a processor to process a bitstream. EXI gets close to the theoretical limit of a bitstream to describe information. The highest performing architecture would combine the process phase integration and encoding approaches together. Taking the point about confining EXI to the present scope of binary formats and the implicit assumption in the template given by TAG for answering the propositions 1a and 1b - that EXI addresses only *existing* use cases, to enable an incremental expansion of the possible uses of structured data. But EXI additionally addresses the next wave of possibility, especially if one considers EXI as a compact Infoset data format: a Brazilian medical researcher downloads a PUBMED file to their PDA before making a trip into the country to find active compounds; a doctor is assisted by a future expert system that can scan a binary indexed RDF knowledge base, while making rounds; a physicist can download a week of data from an accelerator to their flash memory stick and take it home with them for analysis. The Open Grid Service Architecture could include the data itself, and references to points within the data, in their description of distributed computing jobs, rather than only high level descriptions of jobs as now. These are unverified scenarios, but reasonably within the target range of the combination of EXI and technology integration like Screamer. It is already a cliché that data is a bedrock economic resource. The ability to ship it and process it will become only increasingly key to our lives. The syntax and semantics of XML personifies a near ideal format for symbolic data (as opposed to representational data like video, audio etc), since it encodes the idea of hierarchy, meta-data, association, schema etc. It's reasonable to assume that these "stones" are sufficient to describe the data of the majority of problems well into the future. Greg White, for the EXI Working Group - an exceptional team having great success at making something that could really touch many, many, people. [1] 9.1.3 Processing Efficiency Analysis Details, http://www.w3.org/TR/exi-measurements/#Ax-details-pe-summary [2] 6.2.1 Processing Efficiency Summary, http://www.w3.org/TR/exi-measurements/#results-pe-summary . When checking this, bear in mind that quoted figures use the nominal technology (XML.gz and xals) as the baseline for percentages changes and the necessity to average percentages geometrically. [3] See tabular results in 9.1.5.1 of Network Processing Efficiency Analysis Details, http://www.w3.org/TR/exi-measurements/#Ax-details-network-summary . On Nov 29, 2007, at 11:34 AM, Henry S. Thompson wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Following on from the many useful discussions at group and individual > level which took place during the recent TPAC week, the TAG has > arrived at the following requests and recommendations for the > Efficient XML Interchange Working Group: > > 1) Making the case for EXI > > On the face of it two propositions need to be confirmed in order for > the proposed EXI Format to go ahead to REC: > > 1a) The proposed EXI format is the best available choice for the job; > 1b) The performance of the proposed EXI format makes it worth going > ahead. > > The _Efficient XML Interchange Measurements Note_ has a huge amount of > data in it, but it is _very_ difficult to digest. On the face of it, > it is difficult to determine the extent to which it confirms either > (1a) or (1b). A detailed set of comments on this document is provided > at the end of this message. > > What is needed is an summary analysis of the measurement results which > specifically addresses (1a) and (1b) above. With respect to (1a), > this would presumably have something like the following form: > > EXI is better than GZIP because, as shown in tables xxx and yyy, it > produces comparable compression with zzz% less time to compress and > ZZZ% less time to decompress and parse for documents in classes P, Q > and R; > > EXI is better than XMill because. . . > > and so on. > > With respect to (1b), as the TAG suggested in its feedback [1] on the > outputs of the XML Binary Characterization WG, "concrete targets > should be set for the size and/or speed gains that would be needed to > justify the disruption introduced by a new format". It's too late to > set those targets in advance -- what is required now is that a small > number of key use cases be identified, the expected benefits of EXI > for those use cases be quantified, and concrete evidence cited that > those benefits are a) not liable to be seriously eroded by the onward > rush of Moore's law and friends and b) sufficient to make adoption > likely, bearing in mind that EXI is unlikely to be widely available > with a W3C-approved REC behind it for several years. > > The use-case we have heard most commonly nominated to fill this role > is delivery of data-intensive web-services to mobile devices, where > computationally intensive processes place heavy demands on power > supply, and battery technology is _not_ advancing at anything like > Moore's law rates. Documenting and quantifying this story, if true, > would be a very good start. Statements of requirements from potential > deployers would be particularly useful, that is, getting a mobile > device manufacturer to say "We will deploy service X on devices of > class Y iff we can get Z amount of XML data off the wire and into > memory in less than M milliseconds using an N MHz processor. The best > we project with current technologies is MM msec using an NN MHz > processor", accompanied, of course, by a demonstration that EXI could > actually meet that requirement. > > 2) Positioning EXI going forward > > The value of XML interoperability is enormous. If EXI goes forward, > it must do so in a way which does not compromise this. The WG needs > to take every opportunity to get the message across that EXI is > primarily aimed at a set of core use-cases, where one-off binary > formats and/or lower-performance generic compression techniques are > already in use. _No_ aspect of the public presentation of EXI should > suggest that generic XML tools will or ought to evolve to read or > produce the EXI format. > > If the EXI WG agrees with this perspective on positioning, that TAG > will be happy to assist in any way it can to promote it. If the EXI > WG does _not_ agree, further discussion is needed urgently to try to > find a position which both parties can agree to. > > - ------------- > > Detailed comments on the Measurement documents > > The 2005 TAG message called for testing to involve "best possible > text-based XML 1.x implementations", and the XBC's call for > implementations [2] specifically called for only XML parsers, not for > the surrounding application or middleware stacks, JDKs or Java Virtual > Machines. The benchmarks have not been against the best possible > text-based XML 1.x implementations. > > The measurements document acknowledges the issue of stack integration > in "Stack integration considers the full XML processing system, not > just the parser. By selectively combining the components of the > processing stack through abstract APIs, the system can directly > produce application data from the bytes that were read. Two prominent > examples of this technique are [Screamer [3] and [EngelenGSOAP > [4]]. Both of these can also be called schema-derived as they compile > a schema into code. However, neither simply generates a generic > parser, but rather a full stack for converting between application > data and serialized XML. This gives a significant improvement compared > to just applying the pre-compilation to the parsing layer." But > neither of these prominent examples appears in the test data. > > Further, there were no "real-world end to end" use cases tested, such > as a Web service application, a mobile application, etc. Thus we do > not know the overall effect of any particular technology on overall > application performance. > > The measurements document states "To begin with, the XBC > Characterization Measurement Methodologies Note [5] defines thresholds > for whether a candidate format achieves sufficient compactness " [5]. > But in the XBC Characterization Measurement Methodologies Note [5] > itself we find no threshholds: "Because XML documents exist with a > wide variety of sizes, structures, schemas, and regularity, it is not > possible to define a single size threshold or percentage compactness > that an XML format must achieve to be considered sufficiently compact > for a general purpose W3C standard." > > We attempted to determine the differences between Efficient XML and > Gzip but found the methodology confusing. The measurements document > specifies that "In the Document [...] and Both [...] classes, > candidates are compared against gzipped XML, while in the Neither > [...] and Schema [...] cases, the comparison was to plain XML". > Examining Document and Both compactness graphs, Gzip appears to offer > improvements over XML that track the other implementations, with the > noteworthy point that Efficient XML's improvements over Gzip are > significant in a significant part of the Both chart but similar in the > Document. Examining Processing Efficiency graphs, it appears as > though XML is clearly superior in Java Encoding in Document and Both. > GZip appears further inferior but yet all solutions vary wildly around > XML in Decoding Document and Both. A worrying statement is "An > interesting point to note in the decoding results is the shapes of the > graphs for each individual candidate. Namely, these appear similar to > each other, containing similar peaks and troughs. Even more > interestingly, this is also the case with Xals, indicating that there > is some feature of the JAXP parser that is implemented suboptimally > and triggered by a subset of the test documents." The measurements > document states "For instance, preliminary measurements in the EXI > framework indicate that the default parser shipped with Java improved > noticeably from version 5.0 to version 6, showing 2-3-fold improvement > for some cases.", and the measurements used JDK 1.5.0_05-b05 for Java > based parsing and JDK 1.6.0_02-ea-b01 for native. Perhaps an improved > JDK, Java Virtual Machine, or virtualized JVM would further improve > results. These leads us to wonder whether a combination GZip with > improved technologies such as Parsers, JDKs, VMs, or even Stack > Integration technology (that is Schema aware and hence covered under > Both and Schema) would suffice for the community. > > Examining the data sets used, there are a number of military > applications (ASMTF, AVCL, JTLM) and yet comparatively few generic > "Web service" applications. The Google test suite lists Web services > for small devices and Web services routing; the Invoice test suite > lists Intra/InterBusiness Communication which immediately limits its > scope to "A large business communicates via XML with a number of > remote businesses, some of which can be small business partners. These > remote or small businesses often have access only to slow transmission > lines and have limited hardware and technical expertise."; and there > is a WSDL test suite. This seems to avoid the "common" Web service > case of the many Web APIs provided by hosted solutions like Google, > Yahoo, Amazon, eBay, Salesforce, Facebook, MSN, etc. Examining the > test data shows that the Google test cases used 5 different test cases > (0,7,15, 24,30) which includes 1 soap fault (case #24). There are 2 > AVCL, 5 Invoice, 8 Location Sightings, 6 JTLM, 5 ASMTF, 2 WSDL test > cases as well. There appears to be broad based coverage of each, > though the rationale for the various weightings aren't documented. > For example, why 4 Google "success cases" and 2 WSDL cases? Surely > there are more than 2 times as many SOAP messages than WSDL messages > being sent around the internet. > > Henry S. Thompson > on behalf of the TAG > > [1] http://lists.w3.org/Archives/Public/public-xml-binary/2005May/0000.html > [2] http://lists.w3.org/Archives/Public/public-exi/2006Mar/0004.html > [3] http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#ref-screamer > [4] http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#ref-Engelen-gsoap > [5] http://www.w3.org/TR/xbc-measurement/ > - -- > Henry S. Thompson, HCRC Language Technology Group, University of > Edinburgh > Half-time member of W3C Team > 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 > Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk > URL: http://www.ltg.ed.ac.uk/~ht/ > [mail really from me _always_ has this .sig -- mail without it is > forged spam] > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.2.6 (GNU/Linux) > > iD8DBQFHTxRwkjnJixAXWBoRAl75AJ9CjNLK8JXvb7fqlZS0UwClszs6UQCfaryY > eI9+DHa7jeMgRWR22O0Wjx0= > =/JH5 > -----END PGP SIGNATURE----- >
Received on Thursday, 20 December 2007 08:06:09 UTC