Re: BOMs from John Lumley on 2023-04-12 (public-ixml@w3.org from April 2023)

From: John Lumley <john@saxonica.com>
Date: Wed, 12 Apr 2023 17:26:06 +0100
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Cc: Norm Tovey-Walsh <norm@saxonica.com>, Steven Pemberton <steven.pemberton@cwi.nl>, public-ixml@w3.org
Message-Id: <AD6A102F-33EB-4CDC-BDA6-FEAB7F184FD1@saxonica.com>

Given that my browser-based processor handled the BOM, in both input and grammar files, with no special cases, perhaps it is part of the ‘implementation framework/platform/environment’ responsibility.

John Lumley 

Sent from my iPad

> On 12 Apr 2023, at 17:21, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
> 
> Er, if UTF-8 files created on Windows will very often have BOMs, then
> it's presumably not enough to have BOM ignored at the beginning of a
> grammar; it will also be necessary to ignore it at the beginning of
> every input grammar as well.
> 
> I would rather find some way of saying that this is handled by
> lower-level systems and is invisible to ixml -- a bit like whether the
> machine is big- or little-endian.
> 
> If an implementation's I/O routines don't handle BOMs, then surely an
> implementor can work around that with an ad hoc routine when opening a
> stream?
> 
> Presumably I'm missing something.  What is it?
> 
> Michael
> 
> 
> Norm Tovey-Walsh <norm@saxonica.com> writes:
> 
>> [[PGP Signed Part:Undecided]]
>>> We could change the ixml grammar to start:
>>> 
>>>    ixml: BOM?, s, prolog?, rule++RS, s.
>>>    -BOM: -#FEFF.
>>> 
>>> so that the processor doesn't complain about them, but I'm less sure
>>> what to do about input.
>> 
>> I think we should say that if the grammar is in UTF-8 and begins with a
>> BOM, the BOM must be ignored by the processor. I don’t see any reason to
>> surface this wart in the grammar.
>> 
>>> My current feeling is we should warn users that if their inputs are
>>> likely to start with a BOM to add them to the grammar, and that we
>>> don't automatically ignore them.
>> 
>> If I understood the Slack discussion, it’s very hard to tell Windows
>> *not* to put the BOM on the front of UTF-8 files, so anyone using
>> Windows is going to have this problem. That means everyone who writes a
>> grammar is going to end up putting the “ignore BOM” wart on the front of
>> it. That strikes me as even worse than putting it in our grammar.
>> 
>> The only reservation I have about saying a processor must ignore the BOM
>> on inputs is that there’s nothing preventing someone from writing a
>> grammar to parse binary inputs where that sequence isn’t a BOM.
>> 
>> But that seems like something that’s only going to effect the tiniest
>> minority of users, unlike the BOM thing which becomes everyone’s problem
>> as soon as iXML has enough regular users on Windows.
>> 
>>                                        Be seeing you,
>>                                          norm
> 
> 
> -- 
> C. M. Sperberg-McQueen
> Black Mesa Technologies LLC
> http://blackmesatech.com
>

Received on Wednesday, 12 April 2023 16:26:16 UTC