Request OID for "XML 1.0" Record Syntax

I originally sent this three weeks ago, but I don't think anyone
responded.  I am inclined to take this as tacit acceptance of my
proposal.  If anyone objects, please can they say so?  And if no-one
does, Ray, please can I have my OID?

--

Date: Thu Oct 21 16:57:38 +0100 2004
From: Mike Taylor <mike@minmi.miketaylor.org.uk>
To: www-zig@w3.org
Subject: Request OID for "XML 1.0" Record Syntax

I would like to request that the Z39.50 Maintenance Agency issue a new
record syntax OID that is explicitly XML version 1.0, as opposed to
the existing XML OID 1.2.840.10003.5.109.10 which is just XML, and
could be XML 1.0 or XML 1.1.

Here comes the rationale.  Hold on.

Consider a text-and-structured-data repository such as our very own
Zebra.  In principle, it is a storage, indexing and retrieval facility
for any structured data, including binary data and text containing
control characters.  In practice, it needs to use some kind of
structured file format for getting the structured data in and out, and
the overwhelmingly most popular choice for that is XML.

Now XML as we know it (XML 1.0) is actually a pretty poor choice,
because it can't represent certain characters: nothing with a code
below 32, except for the three special cases of tab, linefeed and
carriage return.  See:
	http://www.w3.org/TR/2004/REC-xml-20040204/#charsets
and note that you can't get around this problem by using entities
instead: the entity "&#1;", for example, is ILLEGAL in XML 1.0.  If
you don't believe me (and I wouldn't blame you, it took a lot to
persuade me that this brain-damage is real), just ask your favourite
comformant XML 1.0 parser:

	$ echo "<x>&#1;</x>" | xmllint -
	-:1: error: xmlParseCharRef: invalid xmlChar value 1
	<x>&#1;</x>
	$

Now, consider what a system such as Zebra should do when person A
wants to add to it a record containing a field with a control
character in, and person B wants to retrieve it as XML.  What should
it do?

* It could refuse point blank to add the record, because the record is
  not good XML 1.0.  But (A) that's rude, (B) the record may be
  perfectly good XML 1.1, (C) in practice people do have records like
  that, and should be able to store them in a general purpose
  structured data engine, without being limited by an arbitrary
  prohibition in what amounts to a transfer syntax.  Finally, (D) a
  legitimate MARC record may be added that contains a control
  characters, so the problem will still arise when the record is
  retrieved as XML.

* It could accept the record, but silently discard or transform the
  control character, either at the point where it stores and indexes
  it, or just before it returns it as XML.  This is pragmatically
  appealing in an It Just Works way, but ethically horrifying, since a
  data repository has no business messing with the content of
  someone's record.

* It could just accept the record, and just give it out, without
  even looking at the content.  This is clearly The Right Thing, but
  causes people's XML 1.0 parsers to blow up, so it's no good in
  practice.  And it's no use telling people to use XML 1.1, since that
  is by no means universally implemented (nor, for that matter,
  universally liked.)

So what we think we should do is this: we will continue to have our
repository do The Right Thing, which is accept XML containing control
characters, and return it verbatim when XML records are requested
(using the established record-syntax OID 1.2.840.10003.5.109.10).  But
if a client asks for the new "XML 1.0" record syntax -- the one we're
requesting an OID for -- then we'll return the record stripped of its
XML-unsafe control characters.  Then client programs that need to work
with fussy XML 1.0 parsers can request the new record syntax and know
that they'll get back a record which, though it may not be a perfect
byte-for-byte representation of the data, is legal XML 1.0.

So that's why we need an "XML 1.0" record-syntax OID.

If you consider any of this text helpful, you are very welcome to use
it on the Maintenance Agency site as a rationale for the new OID.

Thanks for listening.

 _/|_	 _______________________________________________________________
/o ) \/  Mike Taylor  <mike@indexdata.com>  http://www.miketaylor.org.uk
)_v__/\  "Looks like it's time to over-technicalize this previously
	 tame post" -- Mickey Mortimer on the dinosaur mailing list

--
Listen to free demos of soundtrack music for film, TV and radio
	http://www.pipedreaming.org.uk/soundtrack/

Received on Wednesday, 10 November 2004 12:27:01 UTC