Re: BOM clarification from Norm Tovey-Walsh on 2023-05-09 (public-ixml@w3.org from May 2023)

From: Norm Tovey-Walsh <norm@saxonica.com>
Date: Tue, 09 May 2023 16:45:20 +0100
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Cc: public-ixml@w3.org
Message-ID: <m2zg6dmrov.fsf@saxonica.com>

> I note in passing that while we think that empirically the unexpected
> appearance of BOMs only occurs in UTF8 data streams, I think that our
> rule can be more general:  if a BOM appears as the first character in
> any data stream, it is either definitely (in the case of an input
> grammar) or almost certainly (in the case of an input string) not
> intended as data and better ignored -- that holds true for any
> encoding including UTF-16 not just UTF-8.  (It's Norm's action to draft
> this, not mine, so this is just a suggestion.)

I believe that the only way for a BOM to appear at the beginning of a
UTF-16 encoded string would be if the UTF-16 BOM was followed by
*another* U+FEFF character. In this case, I think it would be an error
to ignore it.

I think a processor is only licensed to ignore a BOM at the beginning of
an input string if it believes that the input is UTF-8 encoded.

Hopefully my proposed wording is clear (enough).

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Received on Tuesday, 9 May 2023 15:48:42 UTC