- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Tue, 09 May 2023 17:31:22 -0600
- To: Norm Tovey-Walsh <norm@saxonica.com>
- Cc: public-ixml@w3.org
OK. You have more experience with this than I do, so I will defer to your judgement. I am only thinking that we are promulgating a rule which we believe is made necessary because some libraries for writing and reading character streams are doing the wrong thing (or, I guess, more pedantically - the libraries that write the character stream and those that read it are not singing from the same page), and the result is that some ixml processors are seeing BOMs when we think they shouldn't. Under those circumstances, I worry that it might be over-optimistic to assume that the only circumstances in which libraries will do this wrong thing are the ones we have observed. (In particular, will UTF16 readers or writers never be broken? If Microsoft bends its collective brainpower towards inventing another new way to save files that will cause problems for Java file IO libraries, who knows what might happen?) But sufficient unto the day is the evil thereof. Michael Norm Tovey-Walsh <norm@saxonica.com> writes: > [[PGP Signed Part:Undecided]] >> I note in passing that while we think that empirically the unexpected >> appearance of BOMs only occurs in UTF8 data streams, I think that our >> rule can be more general: if a BOM appears as the first character in >> any data stream, it is either definitely (in the case of an input >> grammar) or almost certainly (in the case of an input string) not >> intended as data and better ignored -- that holds true for any >> encoding including UTF-16 not just UTF-8. (It's Norm's action to draft >> this, not mine, so this is just a suggestion.) > > I believe that the only way for a BOM to appear at the beginning of a > UTF-16 encoded string would be if the UTF-16 BOM was followed by > *another* U+FEFF character. In this case, I think it would be an error > to ignore it. > > I think a processor is only licensed to ignore a BOM at the beginning of > an input string if it believes that the input is UTF-8 encoded. > > Hopefully my proposed wording is clear (enough). > > Be seeing you, > norm -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Tuesday, 9 May 2023 23:57:34 UTC