- From: Francois Yergeau <yergeau@alis.com>
- Date: Tue, 02 Feb 1999 22:14:16 -0500
- To: medavis2@us.ibm.com
- Cc: Larry Masinter <masinter@parc.xerox.com>, "Martin J. Duerst" <duerst@w3.org>, Paul Hoffman / IMC <phoffman@imc.org>, MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, ietf-charsets@iana.org
À 13:21 02/02/99 -0800, medavis2@us.ibm.com a écrit : >Francois Yergeau <yergeau@alis.com> on 02/02/99 12:34:14 PM >>Since the problem with BOMs is their ambiguousness -- is it a real BOM or >>an intended ZWNBSP? -- I currently lean toward a "SHOULD NOT put a BOM" >>unless it's mandatory (such as in XML), in which case it is also >>unambiguous. > >*** I disagree (if I understand you correctly). > >If we have the three labels, then as a sender my role is clear. If the text >might come from a source that uses BOM (XML file, Windows file) send as >UTF-16. Two problems here: 1) use UTF-16 only if you *know* that the source has a BOM or that it's BE, otherwise you might send LE without a BOM; "might come" is not enough; not every Windows file will have a BOM, it depends on the creating application. 2) Are we sure that we want to drop the more specific BE|LE tags, and thereby force the receiver to peek inside the object, just because we know there is a BOM? I'm not convinced yet. > If it doesn't (any other Unicode string!), then I will send >UTF-16BE/LE (depending on the polarity). And perhaps generate illegal stuff, because there turns out to be a BOM after all. I'm questionning the enforceability of forbidding BOMs. Let's not make a rule that cannot be obeyed by reasonable implementations in most cases. >As a receiver, my role is also clear. If I receive UTF-16BE/LE, then any >initial <FE,FF> is a real ZWNBSP. ...is purported to be a real ZWNBSP. > If I receive UTF-16, then any initial ><FE,FF> is a BE BOM, any initial <FF,FE> is an LE BOM. No argument here. >Let's face it--the BOM is a hack designed to work with systems where text >streams are untagged. And unfortunately, it also has an equally valid other >semantic (a price of the merger with 10646, since SC2 objected to having a >character with only the semantic of the BOM.) Sigh. Too true. Things would be much simpler if a BOM could only be a BOM. Too late. Facing this, there doesn't seem to be any ideal solution, guaranteed to always work. We have to decide on the best compromise. I think it's slightly better not to forbid BOMs completely with the BE/LE tags, but I'm willing to go for forbidding them if that's the consensus. At this point, the most important thing is to register those $%?&* tags in a workable manner and make UTF-16 useable on the Internet. One last thing: >*** Even if XML did not require a BOM, it would not be unambiguous! Look at >Appendix F in >http://www.xml.com/axml/target.html#sec-guessing. The file would just have >to have the initial '<?xml' like all other encodings. Not necessarily. The initial '<?xml' is not required for UTF-16 (and UTF-8) XML entities. Such entities may begin with a comment, white space or an element start tag. It's the mandated BOM that makes UTF-16 entities auto-detectable. Requiring the initial '<?xml' would be an incompatible change to the XML spec, invalidating existing data and applications. -- François
Received on Friday, 5 February 1999 19:25:09 UTC