- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 03 Feb 1999 22:11:12 +0900
- To: medavis2@us.ibm.com
- Cc: Francois Yergeau <yergeau@alis.com>, Larry Masinter <masinter@parc.xerox.com>, Paul Hoffman / IMC <phoffman@imc.org>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, ietf-charsets@iana.org
At 13:21 99/02/02 -0800, medavis2@us.ibm.com wrote: > Francois Yergeau <yergeau@alis.com> on 02/02/99 12:34:14 PM > Since the problem with BOMs is their ambiguousness -- is it a real BOM or > an intended ZWNBSP? -- I currently lean toward a "SHOULD NOT put a BOM" > unless it's mandatory (such as in XML), in which case it is also > unambiguous. I lean more towards "MUST NOT". There is no requirement from XML on "UTF-16BE" or "UTF-16LE". > *** I disagree (if I understand you correctly). > > If we have the three labels, then as a sender my role is clear. If the text > might come from a source that uses BOM (XML file, Windows file) send as > UTF-16. If it doesn't (any other Unicode string!), then I will send > UTF-16BE/LE (depending on the polarity). > > As a receiver, my role is also clear. If I receive UTF-16BE/LE, then any > initial <FE,FF> is a real ZWNBSP. If I receive UTF-16, then any initial > <FE,FF> is a BE BOM, any initial <FF,FE> is an LE BOM. Exactly. If we have differnt labels, but they all more or less mean the same, that doesn't make sense. > Martin D$B—S(Jst: > >> We wouldn't have to change XML, only to add a clarification to > >> say that "UTF-16" in the XML spec means only the case > >> charset="UTF-16", and not the others. > > That doesn't work. The producer of an XML entity is not necessarily the > MIME processor that will tag it, and may not know whether the entity will > be tagged UTF-16 or UTF16(BE|LE). Does it put a BOM? It puts a BOM or not depending on the environment it is in. On a plain file system, I personally would put a BOM. The MIME processor that sends things out should know the environment, and should either use the appropriate tag (i.e. just "UTF-16" if it's the file system above), or using its policy and doing the work needed for that (e.g. stripping off the BOM and adding the approriate tag ("UTF-16BE" or "UTF-16LE")). The MIME processor has quite a few choices. What's important is that it knows what it's dealing with, on both sides. That's the same problem for all other charsets, isn't it? > *** Even if XML did not require a BOM, it would not be unambiguous! Look at > Appendix F in > http://www.xml.com/axml/target.html#sec-guessing. The file would just have > to have the initial '<?xml' like all other encodings. To quote: > > "Because each XML entity not in UTF-8 or UTF-16 format must begin with an > XML encoding declaration, in which the first characters must be '<?xml', > any conforming processor can detect, after two to four octets of input, > which of the following cases apply. In reading this list, it may help to > know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the > Byte Order Mark required of UTF-16 data streams is "#xFEFF". > > ... > 00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus, strictly > speaking, in error) > 3C 00 3F 00: UTF-16, little-endian, no Byte Order Mark (and thus, strictly > speaking, in error) > ..." Yes, but it's not in error if there is an external label, because the external label has precedence. Regards, Martin. #-#-# Martin J. Du"rst, World Wide Web Consortium #-#-# mailto:duerst@w3.org http://www.w3.org
Received on Wednesday, 3 February 1999 13:46:49 UTC