- From: <bugzilla@jessica.w3.org>
- Date: Fri, 31 May 2013 05:17:39 +0000
- To: www-validator-cvs@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=22223 Bug ID: 22223 Summary: Latin-1 characters (æ, þ etc.) are rejected as errors by validator Classification: Unclassified Product: Validator (Nu) Version: unspecified Hardware: PC OS: All Status: NEW Severity: major Priority: P2 Component: General Assignee: mike+validator@w3.org Reporter: ahangama@gmail.com QA Contact: www-validator-cvs@w3.org An HTML page that has any characters from ISO-8859-1 character set used to validate as correct when written and tested for HTML4.1. Then when such a page was written for HTYML5 was tested, windows-1252 was advised to be used over ISO-8859-1. If there was no charset declaration, it was assumed to be of WINDOWS-1252 and passed. Until recently, UTF-8 was encouraged to be used as charset declaration, but WINDOWS-1252 was accepted. And now the rule is enforced by issuing these errors / warnings like these: 1. Using windows-1252 instead of the declared encoding iso-8859-1. 2. Legacy encoding windows-1252 used. Documents should use UTF-8. 3. utf8 "\xE6" does not map to Unicode. What does 3. above mean? This is a catch-22. If you declare UTF-8, it is an error because æ, þ and are outside Unicode. I thought we are talking about UTF-8 encoding of characters. How does Unicode factor in here? RFC-3629 is very clear about how to encode ASCII and Latin-1 (SBCS) characters into UTF-8. It appears that ASCII is accepted and Latin-1 Extension is rejected for some unpublished reason. Please check these pages to understand the problem. http://ahangama.com/charset-iso-8859-1.htm http://ahangama.com/charset-none.htm http://ahangama.com/charset-utf-8.htm http://ahangama.com/charset-windows-1252.htm Thank you. -- You are receiving this mail because: You are the QA Contact for the bug.
Received on Friday, 31 May 2013 05:17:40 UTC