- From: Richard Ishida <ishida@w3.org>
- Date: Mon, 23 Feb 2004 11:18:03 -0000
- To: "'Bjoern Hoehrmann'" <derhoermi@gmx.net>, "'Martin Duerst'" <duerst@w3.org>
- Cc: <public-i18n-geo@w3.org>
I think this is the same thing Deborah was talking about wrt BBC pages. RI > -----Original Message----- > From: public-i18n-geo-request@w3.org > [mailto:public-i18n-geo-request@w3.org] On Behalf Of Bjoern Hoehrmann > Sent: 23 February 2004 04:37 > To: Martin Duerst > Cc: public-i18n-geo@w3.org > Subject: BOM vs. file inclusions (was: Re: Hexdump web service) > > > > * Martin Duerst wrote: > >> It is nowadays quite common that authors have trouble with some > >>encoding issues, literal U+00A0 characters in web pages, multiple > >>Unicode signatures due to server side file inclusion mechanisms, > > > >This one sounds scary. But I assume you haven't just made it up. Can > >you tell us more, e.g. with examples? We are still thinking > about what > >exactly to do with the BOM in various cases, and so input in > this area > >is quite appreciated. > > Sure, just have a look at > > > http://www.w3.org/TR/2004/WD-xhtml-modularization-20040218/sch ema_module_defs.html#a_module_XHTML_Common_Attri> bute_Definitions > > The source of the document contains > > [...] > <pre class="dtd"> > <?xml version="1.0" encoding="UTF-8"?> > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > targetNamespace="http://www.w3.org/1999/xhtml" > xmlns:xml="http://www.w3.org/XML/1998/namespace" > xmlns="http://www.w3.org/1999/xhtml"> > [...] > >  is a UTF-8 encoded BOM when interpreted as > ISO-8859-1 and then escaped. A slightly different case is > reported as bug in the CSS Validator, > > > http://www.w3.org/mid/5.2.0.9.0.20040220172527.00b23f28@mail.b tinternet.com A simplified (and stable) test case is at http://www.websitedev.de/markup/validator/tests/double-utf-8-bom.html The document starts with the octet sequence EF BB BF EF BB BF which are two UTF-8 encoded U+FEFFs. This is likely the result of file inclusion. For example, there might be a header.inc like <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html lang="EN" xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" /> and a document.php like <?php include "header.inc" ;?> <title>...</title> </head> <body> <p>...</p> </body> </html> If now both header.inc and document.php start with a BOM you are in trouble, the include() function operates at the binary level, it makes no attempt to determine the encoding of the document or any effort of transcoding, it just copies the octets 1:1 into the ouput, hence the EF BB BF EF BB BF sequence. This ended up as bug report for the CSS Validator because the MarkUp Validator http://www.websitedev.de/markup/validator/tests/double-utf-8-bom.html,valida te seems to happily ignore it and considers my test document Valid XHTML 1.0 Strict... I am not sure how common this might be, but there were at least two similar bug reports for the CSS Validator last year...
Received on Monday, 23 February 2004 06:18:04 UTC