W3C home > Mailing lists > Public > www-validator@w3.org > December 2007

Re: [VE][63] on long title element

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Mon, 10 Dec 2007 10:51:44 +0200
Message-ID: <008201c83b09$e54afb20$0500000a@DOCENDO>
To: "Peter Epps" <pgepps@gmail.com>
Cc: <www-validator@w3.org>

David Dorward wrote:

> On 9 Dec 2007, at 05:59, Peter Epps wrote:
>> Validating http://www.buildersdiscountmart.com/buildernew/cabinets/
>> Error [63]: "character data is not allowed here"
>>
>> What's up?  Did I miss something in the spec?
>
> You seem to have some non-printing characters just before the title
> element.

More exactly, there are two BOM characters there. BOM = Byte Order Mark 
= zero-width no-break space = U+FEFF, intended for use at the start of a 
UTF-16 or UTF-32 data to make sure that the correct byte order is used 
in interpreting data. It is allowed at the start of UTF-8 data too but 
has no use there. In the _midst_ of Unicode data, it is permitted though 
discouraged and there it means what the name zero-width no-break space 
says. The important thing here is that it is not a whitespace character 
in SGML terms.

Since it's a data character, it's not permitted there, since we're in 
the midst of the <head> element, where no data characters as such may 
appear - only elements.

In UTF-8, BOM appears as the three-octet sequence EF BB BF. When viewed 
in a program that mistakenly treats the data as ISO-8859-1 encoded, it 
looks like the otherwise highly unlikely three-character sequence . 
You can see this if you manuall change the encoding in your browser 
(e.g. with the View/Encoding command) to ISO-8859-1

> Try deleting everything from the < of the title element back
> to the > of the previous end tag. Then add those two characters, and
> any whitespace you want, back in.

Well, technically it is sufficient to remove the BOM characters. How you 
do this depends on the authoring software. Beware that there are other 
occurrences of BOM in the document. Since they are in contexts where 
character data _is_ allowed, the validator does not catch them, but they 
can still cause problems. Who knows what they'll cause e.g. in script 
code?

The ultimate problem is probably that the document has been created by 
putting together some pieces generated by programs that insert BOM at 
the start of data, for some reason. When concatenating such pieces, the 
BOM characters should be removed.

Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/ 
Received on Monday, 10 December 2007 08:52:03 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:28 GMT