Re: Z39.50 character encoding

----- Original Message -----
From: "Alan Kent" <ajk@mds.rmit.edu.au>
Sent: Thursday, February 28, 2002 7:35 PM


> On Thu, Feb 28, 2002 at 09:13:20AM -0500, Johan Zeeman wrote:
> > DC by itself is not a record syntax; it is a list of data elements.  To
be a
> > record syntax, the data elements need to be encoded using some scheme.
The
> > one I know about is XML.  And XML explicitly uses UTF-8.
> >
> > j.
>
> Just to clarify, do you mean the XML record syntax in Z39.50 explicitly
> uses UTF-8? XML itself certainly *does not* explicitly use UTF-8.
> That is simply what is common. People do use other encodings with
> XML (UTF-16 for example is completely valid and in usage - for
> example when using Chinese or other scripts, UTF-16 encoded files
> are much smaller than the same UTF-8 encoded files).
>
> I was just curious (without re-reading the XML record syntax) whether
> it was a Z39.50 decree that the XML record syntax mandates UTF-8 encoding.

No, it's an XML thing.

I was speaking from memory and clearly I only half remembered.  And I
certainly did not mean to say that XML permits ONLY UTF-8.

From the XML 1.0 (2nd ed.) spec
( http://www.w3.org/TR/2000/REC-xml-20001006 ) :

2.2:  "... All XML processors must accept the UTF-8 and UTF-16 encodings of
10646; ..."  (betcha that's a surprise for a few people!)

4.3.3: "... In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is an error for an entity ... which begins
with neither a Byte Order Mark nor an encoding declaration to use an
encoding other than UTF-8"

I.e. unless you state otherwise, the character set is UTF-8, which is more
or less what I meant when I said that XML explicitly  uses UTF-8.  Or is
that "implicitly"?

You flag the use of UTF-16 by simply including the Unicode byte-order marks
at the beginning of the document.  You indicate other character sets by
using an "encoding" declaration with the name of the character set, or by an
external mechanism. The reference to "external character encoding
information" seems to me to be unfortunate, since there is no guarantee that
the XML processor has access to character set information carried in HTTP or
MIME headers.

j

Received on Friday, 1 March 2002 09:34:58 UTC