RE: Byte order marks when parse="text" from Jonathan Marsh on 2002-01-09 (www-xml-xinclude-comments@w3.org from January 2002)

From: Jonathan Marsh <jmarsh@microsoft.com>
Date: Wed, 9 Jan 2002 10:49:23 -0800
To: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>
Cc: <www-xml-xinclude-comments@w3.org>
Message-ID: <330564469BFEC046B84E591EB3D4D59C049AC233@red-msg-08.redmond.corp.microsoft.com>

The next draft will make it clear that BOMs are not included, but all
other characters are.  We note that your assertion that a BOM is not
permitted in an XML document is false - a BOM is U+FEFF, not U+FFFE.  We
keep the prohibition on illegal characters.

Thanks,
Jonathan Marsh

> -----Original Message-----
> From: Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu]
> Sent: Sunday, September 02, 2001 7:31 AM
> To: www-xml-xinclude-comments@w3.org
> Subject: Byte order marks when parse="text"
> 
> This message addresses the question of what to do with a byte order
mark
> that is present at the start of a Unicode text document of any kind,
XML
> or otherwise that is included with  <xinclude:include href="doc.txt"
> parse="text"/>.
> 
> The first sentence in 4.3 states, "When parse='text', the include
location
> is dereferenced and the resource is fetched. This resource is treated
as
> plain text and converted to a set of character information items
without
> attempting to parse the resource as XML." However, what if the first
> character in plain text is a byte order mark? In this case,
"Characters
> that are not permitted in XML documents also are an error." Thus any
> including any document beginning with a byte order mark would produce
an
> error. This is clearly a bad thing.
> 
> It might be argued that the first sentence does not require that *all*
> characters in the included document must be converted to character
> information items. However, if this interpretation is allowed, then
> there's nothing except the implementer's good sense to say which
> characters to convert. For instance, an implementer could plausibly
decide
> to ignore all non-XML-legal text characters such as vertical tab and
form
> feed rather than raising an error.
> 
> I suggest this be rewritten to make it explicit that all characters in
the
> included text document except an initial byte order mark must be
included
> and that an initial byte order mark must be deleted before inclusion.
> --
> 
> +-----------------------+------------------------+-------------------+
> | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
> +-----------------------+------------------------+-------------------+
> |          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
> |              http://www.ibiblio.org/xml/books/bible2/              |
> |   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
> +----------------------------------+---------------------------------+
> |  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
> |  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
> +----------------------------------+---------------------------------+

Received on Wednesday, 9 January 2002 14:40:09 UTC