- From: Mike Taylor <mike@indexdata.com>
- Date: Thu, 21 Oct 2004 16:57:38 +0100
- To: www-zig@w3.org
I would like to request that the Z39.50 Maintenance Agency issue a new record syntax OID that is explicitly XML version 1.0, as opposed to the existing XML OID 1.2.840.10003.5.109.10 which is just XML, and could be XML 1.0 or XML 1.1. Here comes the rationale. Hold on. Consider a text-and-structured-data repository such as our very own Zebra. In principle, it is a storage, indexing and retrieval facility for any structured data, including binary data and text containing control characters. In practice, it needs to use some kind of structured file format for getting the structured data in and out, and the overwhelmingly most popular choice for that is XML. Now XML as we know it (XML 1.0) is actually a pretty poor choice, because it can't represent certain characters: nothing with a code below 32, except for the three special cases of tab, linefeed and carriage return. See: http://www.w3.org/TR/2004/REC-xml-20040204/#charsets and note that you can't get around this problem by using entities instead: the entity "", for example, is ILLEGAL in XML 1.0. If you don't believe me (and I wouldn't blame you, it took a lot to persuade me that this brain-damage is real), just ask your favourite comformant XML 1.0 parser: $ echo "<x></x>" | xmllint - -:1: error: xmlParseCharRef: invalid xmlChar value 1 <x></x> $ Now, consider what a system such as Zebra should do when person A wants to add to it a record containing a field with a control character in, and person B wants to retrieve it as XML. What should it do? * It could refuse point blank to add the record, because the record is not good XML 1.0. But (A) that's rude, (B) the record may be perfectly good XML 1.1, (C) in practice people do have records like that, and should be able to store them in a general purpose structured data engine, without being limited by an arbitrary prohibition in what amounts to a transfer syntax. Finally, (D) a legitimate MARC record may be added that contains a control characters, so the problem will still arise when the record is retrieved as XML. * It could accept the record, but silently discard or transform the control character, either at the point where it stores and indexes it, or just before it returns it as XML. This is pragmatically appealing in an It Just Works way, but ethically horrifying, since a data repository has no business messing with the content of someone's record. * It could just accept the record, and just give it out, without even looking at the content. This is clearly The Right Thing, but causes people's XML 1.0 parsers to blow up, so it's no good in practice. And it's no use telling people to use XML 1.1, since that is by no means universally implemented (nor, for that matter, universally liked.) So what we think we should do is this: we will continue to have our repository do The Right Thing, which is accept XML containing control characters, and return it verbatim when XML records are requested (using the established record-syntax OID 1.2.840.10003.5.109.10). But if a client asks for the new "XML 1.0" record syntax -- the one we're requesting an OID for -- then we'll return the record stripped of its XML-unsafe control characters. Then client programs that need to work with fussy XML 1.0 parsers can request the new record syntax and know that they'll get back a record which, though it may not be a perfect byte-for-byte representation of the data, is legal XML 1.0. So that's why we need an "XML 1.0" record-syntax OID. If you consider any of this text helpful, you are very welcome to use it on the Maintenance Agency site as a rationale for the new OID. Thanks for listening. _/|_ _______________________________________________________________ /o ) \/ Mike Taylor <mike@indexdata.com> http://www.miketaylor.org.uk )_v__/\ "Looks like it's time to over-technicalize this previously tame post" -- Mickey Mortimer on the dinosaur mailing list -- Listen to free demos of soundtrack music for film, TV and radio http://www.pipedreaming.org.uk/soundtrack/
Received on Thursday, 21 October 2004 15:57:56 UTC