BOMs

On Slack there is a user who has been bitten by BOMs.

These are an optional byte order mark (#FEFF) at the start of UTF-16 and 
UTF-32 files to indicate the byte-order used.

The Unicode Standard permits the BOM in UTF-8, but does not require or 
recommend its use; byte order has no meaning in UTF-8. The IETF recommends 
that if a protocol always uses UTF-8, then it "SHOULD forbid use of U+FEFF 
as a signature."

We could change the ixml grammar to start:

 ixml: BOM?, s, prolog?, rule++RS, s.
 -BOM: -#FEFF.

so that the processor doesn't complain about them, but I'm less sure what 
to do about input.

My current feeling is we should warn users that if their inputs are likely 
to start with a BOM to add them to the grammar, and that we don't 
automatically ignore them.

Steven

Received on Wednesday, 12 April 2023 08:47:20 UTC