[Bug 22223] New: Latin-1 characters (æ, þ etc.) are rejected as errors by validator from bugzilla@jessica.w3.org on 2013-05-31 (www-validator-cvs@w3.org from May 2013)

From: <bugzilla@jessica.w3.org>
Date: Fri, 31 May 2013 05:17:39 +0000
To: www-validator-cvs@w3.org
Message-ID: <bug-22223-169@http.www.w3.org/Bugs/Public/>

https://www.w3.org/Bugs/Public/show_bug.cgi?id=22223

            Bug ID: 22223
           Summary: Latin-1 characters (æ, þ etc.) are rejected as errors
                    by validator
    Classification: Unclassified
           Product: Validator (Nu)
           Version: unspecified
          Hardware: PC
                OS: All
            Status: NEW
          Severity: major
          Priority: P2
         Component: General
          Assignee: mike+validator@w3.org
          Reporter: ahangama@gmail.com
        QA Contact: www-validator-cvs@w3.org

An HTML page that has any characters from ISO-8859-1 character set used to
validate as correct when written and tested for HTML4.1. Then when such a page
was written for HTYML5 was tested, windows-1252 was advised to be used over
ISO-8859-1. If there was no charset declaration, it was assumed to be of
WINDOWS-1252 and passed.

Until recently, UTF-8 was encouraged to be used as charset declaration, but
WINDOWS-1252 was accepted. And now the rule is enforced by issuing these errors
/ warnings like these:
1. Using windows-1252 instead of the declared encoding iso-8859-1.
2. Legacy encoding windows-1252 used. Documents should use UTF-8.
3. utf8 "\xE6" does not map to Unicode.

What does 3. above mean? This is a catch-22. If you declare UTF-8, it is an
error because æ, þ and are outside Unicode. I thought we are talking about
UTF-8 encoding of characters. How does Unicode factor in here?

RFC-3629 is very clear about how to encode ASCII and Latin-1 (SBCS) characters
into UTF-8. It appears that ASCII is accepted and Latin-1 Extension is rejected
for some unpublished reason.

Please check these pages to understand the problem.
http://ahangama.com/charset-iso-8859-1.htm
http://ahangama.com/charset-none.htm
http://ahangama.com/charset-utf-8.htm
http://ahangama.com/charset-windows-1252.htm

Thank you.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.

Received on Friday, 31 May 2013 05:17:40 UTC