- From: David G. Durand <dgd@cs.bu.edu>
- Date: Thu, 17 Oct 1996 23:16:11 -0400
- To: w3c-sgml-wg@w3.org
At 10:04 PM 10/17/96, Gavin Nicol wrote: >>XML does not explicitly sanction the use of any other encodings. It is >>recognized, however, that many documents exist in other encodings. To >>support processors in dealing with this situation, an XML document may >>contain at its beginning, before any other text, markup, PIs, or white >>space, an Encoding Declaration PI matching >> >>EncDecl ::= >> '<?XML' S 'encoding' Eq ("'" Encoding "'")|('"' Encoding '"') S? '>' >> >>An XML processor may choose to read Encoding Declaration PIs and accept >>nonstandard encodings so declared. In validating processors such >>behavior must be at user option. >... >>An XML document which lacks both the Byte Order Mark and an Encoding >>Declaration PI must be in the UTF-8 encoding. It is an error for a >>document to be in an encoding other than that declared in its Encoding >>Declaration PI. > >This CANNOT be REQUIRED behaviour. This is a gross hack!!! I also cannot >condone the clause "does not explicitly sanction". > >Seems to me like here is another mailing list that I've wasted a lot >of time on... Well, I hope you won't leave, until we've all decided to stop wasting time! I agree with Gavin however that this is not good behavior: The most likely transport method for XML will be HTTP or one of it's replacements. HTTP _has_ proper metatdat facilities. If the MIME stuff isn't good enough, we can define our own MIME header for character encodings. I don't have a problem with allowing hacks to determine the character set (but we can't use character definitions we must say that the initial bytes must be read as Latin-1 characters, say, to determine the encoding. We also need to talk about pad bytes, if needed for some encodings. BUT these hacks must apply if the encoding is not given in metadata. Metadata could be the command-line, a catalog, or the MIME header. The old TEI assumption that we've got just a lump of bytes is not so common any more. By having the "hack" version of the spec as the fallback, we have an easy way of handling things properly whether are users are smart: MIME headers rule! or not-so-smart: "I've got a floppy disk, can you help me read the document?" I hate character set issues, but I have to agree with Gavin that explicitly ignoring the main protocol of the Web is a loser, especially when it has the potential for a nice solution of the problem. We can even explicitly define acceptable channels for metadata. Ignoring the FEFF at the beginning should be required when the metadata is present, as should be ignoring the <?XML encoding ...> hack. It should be legitimitate not to add this information at all, if transmitting over a channel that can convey encoding information (like HTTP). A user's save-to-disk option might well have to add on a yucky header of the sort Gavin deplores (because most filesytems lack metadata, and probably will forever). Is this sort of a compromise reasonable? -- David RE delenda est. _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ http://www.dynamicdiagrams.com/services_map_main.html
Received on Thursday, 17 October 1996 23:11:50 UTC