W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > October 1996

Re: B.1 and B.2 results

From: David G. Durand <dgd@cs.bu.edu>
Date: Thu, 17 Oct 1996 23:16:11 -0400
Message-Id: <v02130515ae8ca32fe37f@[]>
To: w3c-sgml-wg@w3.org
At 10:04 PM 10/17/96, Gavin Nicol wrote:
>>XML does not explicitly sanction the use of any other encodings.  It is
>>recognized, however, that many documents exist in other encodings.  To
>>support processors in dealing with this situation, an XML document may
>>contain at its beginning, before any other text, markup, PIs, or white
>>space, an Encoding Declaration PI matching
>>EncDecl ::=
>>  '<?XML' S 'encoding' Eq ("'" Encoding "'")|('"' Encoding '"') S? '>'
>>An XML processor may choose to read Encoding Declaration PIs and accept
>>nonstandard encodings so declared.  In validating processors such
>>behavior must be at user option.
>>An XML document which lacks both the Byte Order Mark and an Encoding
>>Declaration PI must be in the UTF-8 encoding.  It is an error for a
>>document to be in an encoding other than that declared in its Encoding
>>Declaration PI.
>This CANNOT be REQUIRED behaviour. This is a gross hack!!! I also cannot
>condone the clause "does not explicitly sanction".
>Seems to me like here is another mailing list that I've wasted a lot
>of time on...

Well, I hope you won't leave, until we've all decided to stop wasting time!

I agree with Gavin however that this is not good behavior: The most likely
transport method for XML will be HTTP or one of it's replacements. HTTP
_has_ proper metatdat facilities. If the MIME stuff isn't good enough, we
can define our own MIME header for character encodings.

  I don't have a problem with allowing hacks to determine the character set
(but we can't use character definitions we must say that the initial bytes
must be read as Latin-1 characters, say, to determine the encoding. We also
need to talk about pad bytes, if needed for some encodings.

    BUT these hacks must apply if the encoding is not given in metadata.
Metadata could be the command-line, a catalog, or the MIME header. The old
TEI assumption that we've got just a lump of bytes is not so common any
more. By having the "hack" version of the spec as the fallback, we have an
easy way of handling things properly whether are users are smart:
   MIME headers rule!
or not-so-smart:
   "I've got a floppy disk, can you help me read the document?"

   I hate character set issues, but I have to agree with Gavin that
explicitly ignoring the main protocol of the Web is a loser, especially
when it has the potential for a nice solution of the problem.

    We can even explicitly define acceptable channels for metadata.
Ignoring the FEFF at the beginning should be required when the metadata is
present, as should be ignoring the <?XML encoding ...> hack. It should be
legitimitate not to add this information at all, if transmitting over a
channel that can convey encoding information (like HTTP). A user's
save-to-disk option might well have to add on a yucky header of the sort
Gavin deplores (because most filesytems lack metadata, and probably will

    Is this sort of a compromise reasonable?

    -- David

RE delenda est.

David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
Received on Thursday, 17 October 1996 23:11:50 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 20:25:04 UTC