Re: BOM clarification

> I note in passing that while we think that empirically the unexpected
> appearance of BOMs only occurs in UTF8 data streams, I think that our
> rule can be more general:  if a BOM appears as the first character in
> any data stream, it is either definitely (in the case of an input
> grammar) or almost certainly (in the case of an input string) not
> intended as data and better ignored -- that holds true for any
> encoding including UTF-16 not just UTF-8.  (It's Norm's action to draft
> this, not mine, so this is just a suggestion.)

I believe that the only way for a BOM to appear at the beginning of a
UTF-16 encoded string would be if the UTF-16 BOM was followed by
*another* U+FEFF character. In this case, I think it would be an error
to ignore it.

I think a processor is only licensed to ignore a BOM at the beginning of
an input string if it believes that the input is UTF-8 encoded.

Hopefully my proposed wording is clear (enough).

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Received on Tuesday, 9 May 2023 15:48:42 UTC