- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Tue, 20 Nov 2007 14:42:03 +0900
- To: Felix Sasaki <fsasaki@w3.org>, public-html@w3.org, public-i18n-core@w3.org
Dear I18N WG, HTML WG, I have looked at the minutes of your recent meeting. There are a couple topics I want to comment on, I have created separate threads for them. At 02:06 07/11/10, Felix Sasaki wrote: > >... are at http://www.w3.org/2007/11/09-i18n-minutes.html and below as text. > <Hixie> When a user agent would otherwise use the ISO-8859-1 > encoding, it must instead use the Windows-1252 encoding." > > Henri: that part is a violation of charmod I strongly agree. > Addison doesn't consider that a violation of charmod I'm not sure how this could NOT be a violation. I'd also be quite sure that most if not all of the authors of Charmod Fundamentals would agree. The principle that charset labels must mean what they do is very fundamental to "4.4.2 Character encoding identification" (http://www.w3.org/TR/charmod/#sec-EncodingIdent), and besides taking center stage in C025 (http://www.w3.org/TR/charmod/#C025, for senders) and C030 (http://www.w3.org/TR/charmod/#C030, for receivers), shines through in other criteria in that section. > Addison: There are superset encodings and they're often tagged with > the subset encodings. The character model does not use the concepts of subset encoding or superset encoding. And the fact that there is such practice can't be used to judge whether it conforms to CharMod or not. We all know that sloppy tagging is a reality, not only in the 'charset' area, but using that to claim conformance to CharMod in such a roundabout way is a bad idea. > ... using the superset interpretation doesn't conflict with using > the subset interpretation How not? If you meant that you can tag something as windows-1252 even if it doesn't contain any graphics characters only available in windows-1252, that would be okay, I don't remember that there would be any requirement to label as tightly as possible in CharMod. But the reverse doesn't fly. There are many cases where there is more than one possible superset encoding, so this just doesn't fly. As iso-8859-1 is clearly a superset encoding of US-ASCII, by your argumentation, we would conclude that it's okay to interpret stuff labeled as US-ASCII as iso-8859-1. But by the same logic, we can also argue that it is okay to interpret stuff labeled as US-ASCII as iso-8859-2, and so on. It's easy to see that this doesn't make sense. I could also to some extent understand this if you use these terms with regards to undefined codepoints. We all know Unicode has some as-of-yet undefined codepoints, and we all understand that by using charset labels such as "UTF-8" or "UTF-16",..., we include future additions of characters to Unicode, even if this creates the risk that some characters may not render properly on all platforms. Also, many of us know that windows-1252 still has some unused, reserved code positions. (see e.g. the light bluegreen squares at http://en.wikipedia.org/wiki/Windows-1252). At some point, it had even more. Extending the above reasoning for Unicode, I think it's fair to argue that we expect the label windows-1252 to be usable for the case that Microsoft (who created and controls windows-1252) assigns some of the codepoints that are still open currently. However, iso-8859-1 does NOT have any codepoints reserved for future assignements. Like any other member of the 8859 series, it is designed to leave the C1 area (byte values 0x80-0x9F) for non-graphic characters. It is rather unclear whether they are assigned to any specific control characters, whether one should just consider them mapped to the corresponding codepoints in Unicode (some of which are still unassigned), or whether they can be freely used with some other collection of control characters. The whole issue is mostly irrelevant because in actual iso-8859-1 data, these codepoints for control characters are virtually never used. What is anyway very clear is that iso-8859-1 does, for example, NOT have the Euro symbol at position 0x80, and so on. So requiring to use Windows-1252 to interpret data labeled as iso-8859-1 is in square violation of the relevant conditions in Charmod. In my view, any other conclusion would mean that Charmod Fundamentals isn't worth the paper it's (occasionally) printed on, or the electrons used to send it. > ... We're not proposing a substantive change, just providing more > justification for what you're doing. I very much hope this gets reexamined. There may be various ways to work this out, but just claiming that there is no violation of Charmod in this case is a very bad start. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 20 November 2007 06:13:56 UTC