Re: [VE][63] on long title element from Michael Adams on 2007-12-10 (www-validator@w3.org from December 2007)

From: Michael Adams <linux_mike@paradise.net.nz>
Date: Tue, 11 Dec 2007 08:21:04 +1300
To: www-validator@w3.org
Message-id: <20071211082104.663abc1c.linux_mike@paradise.net.nz>

On Mon, 10 Dec 2007 10:51:44 +0200
Jukka K. Korpela wrote:

> 
> David Dorward wrote:
> 
> > On 9 Dec 2007, at 05:59, Peter Epps wrote:
> >> Validating http://www.buildersdiscountmart.com/buildernew/cabinets/
> >> Error [63]: "character data is not allowed here"
> >>
> >> What's up?  Did I miss something in the spec?
> >
> > You seem to have some non-printing characters just before the title
> > element.
> 
> More exactly, there are two BOM characters there. BOM = Byte Order
> Mark = zero-width no-break space = U+FEFF, intended for use at the
> start of a UTF-16 or UTF-32 data to make sure that the correct byte
> order is used in interpreting data. It is allowed at the start of
> UTF-8 data too but has no use there. In the _midst_ of Unicode data,
> it is permitted though discouraged and there it means what the name
> zero-width no-break space says. The important thing here is that it is
> not a whitespace character in SGML terms.
> 
> Since it's a data character, it's not permitted there, since we're in 
> the midst of the <head> element, where no data characters as such may 
> appear - only elements.
> 
> In UTF-8, BOM appears as the three-octet sequence EF BB BF. When
> viewed in a program that mistakenly treats the data as ISO-8859-1
> encoded, it looks like the otherwise highly unlikely three-character
> sequence . You can see this if you manuall change the encoding in
> your browser (e.g. with the View/Encoding command) to ISO-8859-1
> 
> > Try deleting everything from the < of the title element back
> > to the > of the previous end tag. Then add those two characters, and
> > any whitespace you want, back in.
> 
> Well, technically it is sufficient to remove the BOM characters. How
> you do this depends on the authoring software. Beware that there are
> other occurrences of BOM in the document. Since they are in contexts
> where character data _is_ allowed, the validator does not catch them,
> but they can still cause problems. Who knows what they'll cause e.g.
> in script code?
> 
> The ultimate problem is probably that the document has been created by
> putting together some pieces generated by programs that insert BOM at 
> the start of data, for some reason. When concatenating such pieces,
> the BOM characters should be removed.
> 
> Jukka K. Korpela ("Yucca")
> http://www.cs.tut.fi/~jkorpela/ 
> 

Experience tells me also that PHP "includes" are often the culprit,
where the included file is in UTF-8. I am sure that there may be other
culprits of this type, (as i deal only with PHP as my pre-processor).

-- 
Michael

All shall be well, and all shall be well, and all manner of things shall
be well

 - Julian of Norwich 1342 - 1416

Received on Monday, 10 December 2007 19:16:44 UTC