- From: Michael Adams <linux_mike@paradise.net.nz>
- Date: Tue, 11 Dec 2007 08:21:04 +1300
- To: www-validator@w3.org
On Mon, 10 Dec 2007 10:51:44 +0200 Jukka K. Korpela wrote: > > David Dorward wrote: > > > On 9 Dec 2007, at 05:59, Peter Epps wrote: > >> Validating http://www.buildersdiscountmart.com/buildernew/cabinets/ > >> Error [63]: "character data is not allowed here" > >> > >> What's up? Did I miss something in the spec? > > > > You seem to have some non-printing characters just before the title > > element. > > More exactly, there are two BOM characters there. BOM = Byte Order > Mark = zero-width no-break space = U+FEFF, intended for use at the > start of a UTF-16 or UTF-32 data to make sure that the correct byte > order is used in interpreting data. It is allowed at the start of > UTF-8 data too but has no use there. In the _midst_ of Unicode data, > it is permitted though discouraged and there it means what the name > zero-width no-break space says. The important thing here is that it is > not a whitespace character in SGML terms. > > Since it's a data character, it's not permitted there, since we're in > the midst of the <head> element, where no data characters as such may > appear - only elements. > > In UTF-8, BOM appears as the three-octet sequence EF BB BF. When > viewed in a program that mistakenly treats the data as ISO-8859-1 > encoded, it looks like the otherwise highly unlikely three-character > sequence . You can see this if you manuall change the encoding in > your browser (e.g. with the View/Encoding command) to ISO-8859-1 > > > Try deleting everything from the < of the title element back > > to the > of the previous end tag. Then add those two characters, and > > any whitespace you want, back in. > > Well, technically it is sufficient to remove the BOM characters. How > you do this depends on the authoring software. Beware that there are > other occurrences of BOM in the document. Since they are in contexts > where character data _is_ allowed, the validator does not catch them, > but they can still cause problems. Who knows what they'll cause e.g. > in script code? > > The ultimate problem is probably that the document has been created by > putting together some pieces generated by programs that insert BOM at > the start of data, for some reason. When concatenating such pieces, > the BOM characters should be removed. > > Jukka K. Korpela ("Yucca") > http://www.cs.tut.fi/~jkorpela/ > Experience tells me also that PHP "includes" are often the culprit, where the included file is in UTF-8. I am sure that there may be other culprits of this type, (as i deal only with PHP as my pre-processor). -- Michael All shall be well, and all shall be well, and all manner of things shall be well - Julian of Norwich 1342 - 1416
Received on Monday, 10 December 2007 19:16:44 UTC