Re: Error Message Feedback from Lachlan Hunt on 2005-11-07 (www-validator@w3.org from November 2005)

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Tue, 08 Nov 2005 01:50:26 +1100
To: Sverker Fridqvist <sverker@fridqvist.se>
CC: www-validator@w3.org
Message-ID: <436F69B2.1040900@lachy.id.au>

Sverker Fridqvist wrote:
> Compare the error reports for these two urls:
> 
>     http://sverker.fridqvist.se/test/withutf8.php

This one sends the HTTP header:
   Content-Type: text/html; charset=utf-8

>     http://sverker.fridqvist.se/test/withiso8859.php

This one sends:
   Content-Type: text/html; charset=iso-8859-1

> Both files contain Byte-Order Marks (BOMs) designating UTF-8 encoding.

No, the first contains the BOM (U+FEFF) encoded in UTF-8, the second 
contains the characters "ï»¿" encoded as ISO-8859-1, which just happens 
to be using the same octets as the UTF-8 BOM.  The chances are that the 
author intended this to be the UTF-8 BOM, but the authoritative HTTP 
headers state otherwise.

> The BOM is recognized for the first file, but not for the second one. 

Correct.

> It would be helpful if the validator recognized the BOM also in the 
> second case, and reported that the not-allowed characters in the prolog 
> is a BOM.

The problem is that determining that it is the UTF-8 BOM would require 
ignoring the fact that the document needs to be parsed as ISO-8859-1, or 
whatever other encoding is declared.

> If this is not possible, or easily done, the error message could make a 
> helpful hint towards a BOM:
> 
> "Character ... not allowed in prolog.  The character may be part of a 
> Unicode Byte-Order Mark (BOM). Try changing the character encoding 
> setting of your editor to not include BOMs."
> ]

Better yet, tell them to configure their server to send the correct 
character encoding information.

-- 
Lachlan Hunt
http://lachy.id.au/

Received on Monday, 7 November 2005 14:50:44 UTC