- From: Chris Lilley <chris@w3.org>
- Date: Mon, 17 Feb 2003 15:33:29 +0100
- To: www-tag@w3.org
Hello folks, Action CL 2002/12/02: Write up problem statement about binary XML; send to www-tag. My thanks to those who reviewed earlier drafts of this email and sent me comments; errors are however my own. --------------------------------------------------------------------- Some people would say that binary XML or perhaps more accurately 'binary infosets' is a problem all in itself. However, people still seem to want them, or indeed to be using them, or indeed making standards out of them. So it would be useful to consider why them might do this, what requirements they see for this, rather than merely dismissing it or hoping it goes away. The primary reason that people give for using Binary XML (binary representations of an XML Infoset) is size efficiency - both in network transmission and in storage on the receiving device. It is generally asserted, or assumed, that a binary form is more compact than a textual form (which it may well be, until its extension mechanism if any has been used a few times....) Network Efficiency ------------------ In terms of network efficiency, compression is often used as a work around. Put another way, the binary form is presented as a MIME content-encoding rather than as a new format. For example, gzip compression[1] is used in HTTP 1.1 [2]; SVG 1.0 required SVG implementations that use HTTP 1.1 to support gzipped SVG [3] and implementations routinely use this; the resulting gzipped files are considerably smaller than the raw XML versions, yet the XML can be readily obtained (provided one has space for the requisite 32k buffer for decoding, sufficient CPU power to decode, and space to store the result). XMill [4] is another alternative compression method, which separates the structure and the content and uses different compression methods for these two different partitions. There was a good paper at XML 2002 USA [5] regarding compression for XML messages (in a military, highly bandwidth constrained environment). However, the sender and receiver were desktop-class machines, not mobile class machines. They were able to allocate significant processing on a per-instance basis to ensure small message size. Robin Berjon suggested [6] that other, more XML Specific binary forms might also be registered as content encodings. The argument that the Binary XML proponents make, however, is that bandwidth is not the primary problem. Sure, they want efficient transmission (unless they are selling bandwidth by the packet to end users) but the network performance problem they face is latency over satellite based networks, not bandwidth as such; and the place where they really want efficiency is the memory footprint needed to display the document on the device. Storage Efficiency ------------------ In terms of storage, I have often heard a desire to avoid having both the full-strings version (for example, to answer DOM queries about attribute values in their full, leading and trailing space and cr-including glory) and a 'working set'. However, this argument is sometimes overstated. As an example, consider this valid document <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE foo [ <!ELEMENT foo EMPTY> <!ATTLIST foo toto (yes | no) #REQUIRED> ]> <foo toto="yes"/> and this invalid one <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE foo [ <!ELEMENT foo EMPTY> <!ATTLIST foo toto (yes | no) #REQUIRED> ]> <foo toto=" yes "/> Certainly a binary XML representation could encode the value of the toto attribute in one bit, but equally the maximum of three characters (three bytes in UTF-8) for the attribute value is not excessive. Typing and PSVIs ---------------- There is also sometimes a desire to have a concrete representation of a PSVI; this seems to be a factor in the MPEG-7 BiM[7] and Expway Bin-XML[8] forms. BiM appears to require a W3C XML Schema, which may make it less than universally suitable; Bin-XML, like XMill, can make use of extra information from a schema but does not require it. This issue of generality also affects WAP Binary XML [9], a fixed encoding scheme for a particular set of elements and attributes. Streaming and Random Access --------------------------- Lastly, the need for streaming and random access is sometimes cited as a reason for preferring a binary representation (particularly for BiM - see [7.1]). This has a relationship with work on XML fragments and on packaging, both of which W3C has sort of been interested in..... A key paper [7.2] about BiM has a good description of why it exists, in the abstract: "In the course of the work on the MPEG-7 standard a binary format with special features for the encoding of XML data was required. These required key features are a high data compression ratio, provision for streaming, dynamic update of the document structure and fast random access of data entities in the compressed stream. To support these features we propose a novel, schema-aware approach which exploits the knowledge of the standardized MPEG-7 syntax definition of the encoded XML document on the encoder and decoder side. The technique is part of the MPEG-7 standard. This paper gives an overview of the coding algorithm, including a comparison to standard (XML-) compression tools." An overview paper [7.1] describes the relationship between the XML and BiM encodings: "MPEG-7 description can be represented either in textual format (XML), in binary format (BiM), or a mixture of the two formats, depending on application usage. MPEG-7 defines a unique mapping between the binary format and the textual format. A bi-directional loss-less mapping between the textual representation and the binary representation is possible. Still, it shall not always be used: some applications may not want to transmit all the information contained in the textual representation and may prefer to use a lossy transmission that is more efficient in terms of bandwidth." It seems like this aspect might impact the Timed Text activity, and also SVG 1.2 if it tries to get better streaming behavior for large files. Trust boundaries ---------------- Dan Connolly notes [10] that where the two parties communicating information are in a highly trusted relationship that does not involve interoperability with multiple parties (an example of such a relationship being the communication between a mobile phone and whatever proxy the phone vendor uses to give the appearance of wireless browsing of Web pages) then the precise mechanism used is outside the realm of standardization. However, when interoperability is needed between multiple parties that do not necessarily trust one another (the usual case) then Binary XML of whatever form raises the same security issues that binary RPC or other such mechanisms are continually facing. Dan asserts that the overhead of doing this security checking is greater than that of doing the XML parsing. That may be true; or it may not. Once some XML is parsed into a DOM, is that a security issue? If that binary data structure is sent to someone else, is that a security issue/ When XML is seen as a transitory serialization medium used to move an infoset/a DOM/whatever between two computers, it is less clear that the form of that serialization affects security in any major way; its more what access is given to the host machine by any programatic elements such as script that matters. Closing remarks --------------- There has been previous. discussion on this subject on www-tag in regard to a casual mention of binary infosets [11] Additional examples of such binary infosets are Sharp NVA [12]; the CVG proposal made to the 3GPP for transmission of SVG Tiny files [13], the Millau encoding format presented at the WWW9 conference [14] and IBM Xtalk [15]. The Cover Pages have an overview of this area [16]. It is entirely possible that I have omitted some major benefit that people see in a binary infoset representation, in which case I am sure they will be quick to tell me. Please discuss, in a focused manner, and with a view to what the wording of a TAG finding should be. [1] Gzip compression http://www.gzip.org/zlib/zlib.html http://www.gzip.org/zlib/zlib_docs.html [2] Hypertext Transfer Protocol -- HTTP/1.1 ftp://ftp.isi.edu/in-notes/rfc2616.txt [3] G.7 Conforming SVG Viewers http://www.w3.org/TR/SVG/conform.html#ConformingSVGViewers [4] XMill: An Efficient Compressor for XML http://www.research.att.com/sw/tools/xmill/ Paper on this by Hartmut Liefke, Dan Suciu http://citeseer.nj.nec.com/261815.html [5] XML Sizing and Compression Study for Military Wireless Data http://www.xmlconference.org/xmlusa/2002/friday.asp#17 [6] XML-specific content encodings http://lists.w3.org/Archives/Public/www-tag/2002Oct/0189.html [7] MPEG-7 and MPEG-4 BiM [7.1] Overview of MPEG-7 systems (including BiM) http://gps-tsc.upc.es/imatge/pub/ps/IEEE_CSVT2001_Avaro_Salembier.pdf [7.2] AN MPEG-7 TOOL FOR COMPRESSION AND STREAMING OF XML DATA http://www.lnt.de/~kaup/paper/icme-2002.pdf [8] Expway Bin-XML Bin-XML™ for encoding XML documents http://expway.tv/graph/Bin-XMLTechnical%20White%20Paper.pdf [9] WAP Binary XML Content Format http://www.w3.org/TR/wbxml/ Staff comment http://www.w3.org/Submission/1999/07/Comment [10] binaryXML, marshalling, and and trust boundaries http://lists.w3.org/Archives/Public/www-tag/2002Dec/0022.html [11] Discussion on www-tag regarding mention of Binary Infoset in the arch document http://lists.w3.org/Archives/Public/www-tag/2002Oct/0193.html [12] Sharp NVA Overview of NVA in Sharp Motion Art/e-animator http://www.galamo.com/sharpmotionart/business/technology.pdf [12.1] Sharp Motion Art site http://galamo.com/sharpmotionart/ [12.2] Example of use http://anime.galamo.com/eanime/JSP/index.jsp (in Japanese) [12.3] JPhone support for NVA http://www.ktlink.jp/hlp/SupportModelJPhone.html (in Japanese) [13] CVG Documents from 3GPP EMS meeting, Paris (zipped) Comparisons of Bijitec, iSketch and CVG file sizes for SVG Tiny content http://www.3gpp.org/ftp/tsg_t/WG2_Capability/SWG3/SWG3_EMS_03_Paris/Docs/ [14] Millau: an encoding format for efficient representation and exchange of XML over the Web http://www9.org/w9cdrom/154/154.html [15] Discussion of IBM XTalk http://www.oreillynet.com/lpt/wlg/1858 [15.1] Vinci: A Service-Oriented Architecture for Rapid Development of Web Applications http://www10.org/cdrom/papers/506/ [16] XML and Compression http://xml.coverpages.org/xmlAndCompression.html -- Chris mailto:chris@w3.org
Received on Monday, 17 February 2003 09:33:34 UTC