- From: <bugzilla@jessica.w3.org>
- Date: Fri, 31 May 2013 05:17:39 +0000
- To: www-validator-cvs@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=22223
Bug ID: 22223
Summary: Latin-1 characters (æ, þ etc.) are rejected as errors
by validator
Classification: Unclassified
Product: Validator (Nu)
Version: unspecified
Hardware: PC
OS: All
Status: NEW
Severity: major
Priority: P2
Component: General
Assignee: mike+validator@w3.org
Reporter: ahangama@gmail.com
QA Contact: www-validator-cvs@w3.org
An HTML page that has any characters from ISO-8859-1 character set used to
validate as correct when written and tested for HTML4.1. Then when such a page
was written for HTYML5 was tested, windows-1252 was advised to be used over
ISO-8859-1. If there was no charset declaration, it was assumed to be of
WINDOWS-1252 and passed.
Until recently, UTF-8 was encouraged to be used as charset declaration, but
WINDOWS-1252 was accepted. And now the rule is enforced by issuing these errors
/ warnings like these:
1. Using windows-1252 instead of the declared encoding iso-8859-1.
2. Legacy encoding windows-1252 used. Documents should use UTF-8.
3. utf8 "\xE6" does not map to Unicode.
What does 3. above mean? This is a catch-22. If you declare UTF-8, it is an
error because æ, þ and are outside Unicode. I thought we are talking about
UTF-8 encoding of characters. How does Unicode factor in here?
RFC-3629 is very clear about how to encode ASCII and Latin-1 (SBCS) characters
into UTF-8. It appears that ASCII is accepted and Latin-1 Extension is rejected
for some unpublished reason.
Please check these pages to understand the problem.
http://ahangama.com/charset-iso-8859-1.htm
http://ahangama.com/charset-none.htm
http://ahangama.com/charset-utf-8.htm
http://ahangama.com/charset-windows-1252.htm
Thank you.
--
You are receiving this mail because:
You are the QA Contact for the bug.
Received on Friday, 31 May 2013 05:17:40 UTC