Re: BOM clarification from C. M. Sperberg-McQueen on 2023-05-09 (public-ixml@w3.org from May 2023)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Tue, 09 May 2023 17:31:22 -0600
To: Norm Tovey-Walsh <norm@saxonica.com>
Cc: public-ixml@w3.org
Message-ID: <87bkithxct.fsf@blackmesatech.com>

OK.  You have more experience with this than I do, so I will defer to
your judgement.

I am only thinking that we are promulgating a rule which we believe is
made necessary because some libraries for writing and reading character
streams are doing the wrong thing (or, I guess, more pedantically - the
libraries that write the character stream and those that read it are not
singing from the same page), and the result is that some ixml processors
are seeing BOMs when we think they shouldn't.

Under those circumstances, I worry that it might be over-optimistic to
assume that the only circumstances in which libraries will do this wrong
thing are the ones we have observed.  (In particular, will UTF16 readers
or writers never be broken?  If Microsoft bends its collective
brainpower towards inventing another new way to save files that will
cause problems for Java file IO libraries, who knows what might happen?)

But sufficient unto the day is the evil thereof.  

Michael

Norm Tovey-Walsh <norm@saxonica.com> writes:

> [[PGP Signed Part:Undecided]]
>> I note in passing that while we think that empirically the unexpected
>> appearance of BOMs only occurs in UTF8 data streams, I think that our
>> rule can be more general:  if a BOM appears as the first character in
>> any data stream, it is either definitely (in the case of an input
>> grammar) or almost certainly (in the case of an input string) not
>> intended as data and better ignored -- that holds true for any
>> encoding including UTF-16 not just UTF-8.  (It's Norm's action to draft
>> this, not mine, so this is just a suggestion.)
>
> I believe that the only way for a BOM to appear at the beginning of a
> UTF-16 encoded string would be if the UTF-16 BOM was followed by
> *another* U+FEFF character. In this case, I think it would be an error
> to ignore it.
>
> I think a processor is only licensed to ignore a BOM at the beginning of
> an input string if it believes that the input is UTF-8 encoded.
>
> Hopefully my proposed wording is clear (enough).
>
>                                         Be seeing you,
>                                           norm

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Tuesday, 9 May 2023 23:57:34 UTC