- From: Henry S. Thompson <ht@inf.ed.ac.uk>
- Date: Thu, 29 Nov 2007 19:34:59 +0000
- To: public-exi@w3.org
- Cc: www-tag@w3.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Following on from the many useful discussions at group and individual level which took place during the recent TPAC week, the TAG has arrived at the following requests and recommendations for the Efficient XML Interchange Working Group: 1) Making the case for EXI On the face of it two propositions need to be confirmed in order for the proposed EXI Format to go ahead to REC: 1a) The proposed EXI format is the best available choice for the job; 1b) The performance of the proposed EXI format makes it worth going ahead. The _Efficient XML Interchange Measurements Note_ has a huge amount of data in it, but it is _very_ difficult to digest. On the face of it, it is difficult to determine the extent to which it confirms either (1a) or (1b). A detailed set of comments on this document is provided at the end of this message. What is needed is an summary analysis of the measurement results which specifically addresses (1a) and (1b) above. With respect to (1a), this would presumably have something like the following form: EXI is better than GZIP because, as shown in tables xxx and yyy, it produces comparable compression with zzz% less time to compress and ZZZ% less time to decompress and parse for documents in classes P, Q and R; EXI is better than XMill because. . . and so on. With respect to (1b), as the TAG suggested in its feedback [1] on the outputs of the XML Binary Characterization WG, "concrete targets should be set for the size and/or speed gains that would be needed to justify the disruption introduced by a new format". It's too late to set those targets in advance -- what is required now is that a small number of key use cases be identified, the expected benefits of EXI for those use cases be quantified, and concrete evidence cited that those benefits are a) not liable to be seriously eroded by the onward rush of Moore's law and friends and b) sufficient to make adoption likely, bearing in mind that EXI is unlikely to be widely available with a W3C-approved REC behind it for several years. The use-case we have heard most commonly nominated to fill this role is delivery of data-intensive web-services to mobile devices, where computationally intensive processes place heavy demands on power supply, and battery technology is _not_ advancing at anything like Moore's law rates. Documenting and quantifying this story, if true, would be a very good start. Statements of requirements from potential deployers would be particularly useful, that is, getting a mobile device manufacturer to say "We will deploy service X on devices of class Y iff we can get Z amount of XML data off the wire and into memory in less than M milliseconds using an N MHz processor. The best we project with current technologies is MM msec using an NN MHz processor", accompanied, of course, by a demonstration that EXI could actually meet that requirement. 2) Positioning EXI going forward The value of XML interoperability is enormous. If EXI goes forward, it must do so in a way which does not compromise this. The WG needs to take every opportunity to get the message across that EXI is primarily aimed at a set of core use-cases, where one-off binary formats and/or lower-performance generic compression techniques are already in use. _No_ aspect of the public presentation of EXI should suggest that generic XML tools will or ought to evolve to read or produce the EXI format. If the EXI WG agrees with this perspective on positioning, that TAG will be happy to assist in any way it can to promote it. If the EXI WG does _not_ agree, further discussion is needed urgently to try to find a position which both parties can agree to. - ------------- Detailed comments on the Measurement documents The 2005 TAG message called for testing to involve "best possible text-based XML 1.x implementations", and the XBC's call for implementations [2] specifically called for only XML parsers, not for the surrounding application or middleware stacks, JDKs or Java Virtual Machines. The benchmarks have not been against the best possible text-based XML 1.x implementations. The measurements document acknowledges the issue of stack integration in "Stack integration considers the full XML processing system, not just the parser. By selectively combining the components of the processing stack through abstract APIs, the system can directly produce application data from the bytes that were read. Two prominent examples of this technique are [Screamer [3] and [EngelenGSOAP [4]]. Both of these can also be called schema-derived as they compile a schema into code. However, neither simply generates a generic parser, but rather a full stack for converting between application data and serialized XML. This gives a significant improvement compared to just applying the pre-compilation to the parsing layer." But neither of these prominent examples appears in the test data. Further, there were no "real-world end to end" use cases tested, such as a Web service application, a mobile application, etc. Thus we do not know the overall effect of any particular technology on overall application performance. The measurements document states "To begin with, the XBC Characterization Measurement Methodologies Note [5] defines thresholds for whether a candidate format achieves sufficient compactness " [5]. But in the XBC Characterization Measurement Methodologies Note [5] itself we find no threshholds: "Because XML documents exist with a wide variety of sizes, structures, schemas, and regularity, it is not possible to define a single size threshold or percentage compactness that an XML format must achieve to be considered sufficiently compact for a general purpose W3C standard." We attempted to determine the differences between Efficient XML and Gzip but found the methodology confusing. The measurements document specifies that "In the Document [...] and Both [...] classes, candidates are compared against gzipped XML, while in the Neither [...] and Schema [...] cases, the comparison was to plain XML". Examining Document and Both compactness graphs, Gzip appears to offer improvements over XML that track the other implementations, with the noteworthy point that Efficient XML's improvements over Gzip are significant in a significant part of the Both chart but similar in the Document. Examining Processing Efficiency graphs, it appears as though XML is clearly superior in Java Encoding in Document and Both. GZip appears further inferior but yet all solutions vary wildly around XML in Decoding Document and Both. A worrying statement is "An interesting point to note in the decoding results is the shapes of the graphs for each individual candidate. Namely, these appear similar to each other, containing similar peaks and troughs. Even more interestingly, this is also the case with Xals, indicating that there is some feature of the JAXP parser that is implemented suboptimally and triggered by a subset of the test documents." The measurements document states "For instance, preliminary measurements in the EXI framework indicate that the default parser shipped with Java improved noticeably from version 5.0 to version 6, showing 2-3-fold improvement for some cases.", and the measurements used JDK 1.5.0_05-b05 for Java based parsing and JDK 1.6.0_02-ea-b01 for native. Perhaps an improved JDK, Java Virtual Machine, or virtualized JVM would further improve results. These leads us to wonder whether a combination GZip with improved technologies such as Parsers, JDKs, VMs, or even Stack Integration technology (that is Schema aware and hence covered under Both and Schema) would suffice for the community. Examining the data sets used, there are a number of military applications (ASMTF, AVCL, JTLM) and yet comparatively few generic "Web service" applications. The Google test suite lists Web services for small devices and Web services routing; the Invoice test suite lists Intra/InterBusiness Communication which immediately limits its scope to "A large business communicates via XML with a number of remote businesses, some of which can be small business partners. These remote or small businesses often have access only to slow transmission lines and have limited hardware and technical expertise."; and there is a WSDL test suite. This seems to avoid the "common" Web service case of the many Web APIs provided by hosted solutions like Google, Yahoo, Amazon, eBay, Salesforce, Facebook, MSN, etc. Examining the test data shows that the Google test cases used 5 different test cases (0,7,15, 24,30) which includes 1 soap fault (case #24). There are 2 AVCL, 5 Invoice, 8 Location Sightings, 6 JTLM, 5 ASMTF, 2 WSDL test cases as well. There appears to be broad based coverage of each, though the rationale for the various weightings aren't documented. For example, why 4 Google "success cases" and 2 WSDL cases? Surely there are more than 2 times as many SOAP messages than WSDL messages being sent around the internet. Henry S. Thompson on behalf of the TAG [1] http://lists.w3.org/Archives/Public/public-xml-binary/2005May/0000.html [2] http://lists.w3.org/Archives/Public/public-exi/2006Mar/0004.html [3] http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#ref-screamer [4] http://www.w3.org/TR/2007/WD-exi-measurements-20070725/#ref-Engelen-gsoap [5] http://www.w3.org/TR/xbc-measurement/ - -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh Half-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail really from me _always_ has this .sig -- mail without it is forged spam] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (GNU/Linux) iD8DBQFHTxRwkjnJixAXWBoRAl75AJ9CjNLK8JXvb7fqlZS0UwClszs6UQCfaryY eI9+DHa7jeMgRWR22O0Wjx0= =/JH5 -----END PGP SIGNATURE-----
Received on Thursday, 29 November 2007 19:35:32 UTC