- From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
- Date: Sat, 1 Aug 2009 17:08:49 +0200
- To: public-html-comments@w3.org
Bil Corry: > Dr. Olaf Hoffmann wrote on 7/31/2009 1:10 PM: > > With the still open questions I mainly try to find out whether it is > > possible to specify, that a 'HTML5' document has an encoding like > > 'ISO-8859-1' and not 'Windows-1252'. > > You do it the same way as you would for any character set, by specifying > the content encoding as ISO-8859-1. Typically this is done via the > Content-Type header: > > Content-Type: text/html; charset=ISO-8859-1 > > That header means, "This HTML document is in the ISO-8859-1 character set." > By inference, it also means that it isn't Windows-1252, or UTF-8, etc. This I know and is true for other formats but 'HTML5', the current draft of 'HTML5' has a specific rule, that this means 'Windows-1252' and not 'ISO-8859-1' - and this seems to supersede what the server indicates, if a viewer is able to identify is as a 'HTML5' document. > > > As far as I understand the specification, this is not possible > > I think what you mean is it isn't possible to force the UA to use the > ISO-8859-1 charset when specified and you're right. It is never possible for an author to force an arbitrary viewer to do something (specific). An careful author can only write well defined and valid documents. And an author can test it with some currently known viewers. The results of these tests are not necessarily representative for other viewers, not accessible for the author (for example because they are not published yet). > As I mentioned in my > previous email, IE will display Windows-1252 when MacRoman is specified > which is clearly wrong -- the glyphs don't even come close to matching each > other. As an author you have to work around that. And as an author you > must ensure the encoding is correct to get consistent results. > Maybe this browser does not know this encoding, however then the user should be warned, that in such a case something stupid might happen. The warning is clearly the problem of the implementor, not the author. Of course, the author can avoid such problems, using more common encodings like UTF-8 or ISO-8859-1. > > but > > then the other questions become interesting, because typically > > it is relevant what the server indicates, not what is mentioned > > explictely or implicitely in the document. > > I haven't ever tested what happens when the content-type header doesn't > match the meta, but considering it's incorrect, an author can hardly expect > positive results. I wonder if HTML5 specifies the behavior in this case? > I think, 'HTML5' only specifies a specific rule for these few strings, that 'Windows-1252' is used instead of them. Only if the author provides these strings, this indication is superseded, independently from the question, how the encoding was indicated. If the server sends with the content-type header UTF-8, the specific 'HTML5' rule does not apply and everything is interpreted as UTF-8, no matter what is mentioned within the document. > > As already mentioned, years ago there were browsers/versions without this > > bug and for typical 'ISO-8859-1' a slightly wrong decoding is not really > > a problem, as already discussed. A practical problem currently only > > appears, if a document has the wrong encoding information and a legacy > > browser without the bug is used to present it. Because not all legacy > > browsers have this bug, it is a misinformation of authors to make them > > believe, that everything is solved for buggy documents, because all > > browser > > compensate this bug with yet another browser bug. However, authors > > of documents with wrong encoding informations are guilty anyway, > > this is not really a problem of a specification. It is just more > > difficult to teach them to write proper documents. Indifferent or > > ignorant > > authors are not necessarily a problem for a specification, they are > > a problem mainly for the general audience. > > This also describes the issue with browsers doing a best-guess with > rendering HTML content that is malformed. The solution to both malformed > HTML and misidentified charsets is to run the page through validator.w3.org > -- both get flagged if wrong. If you want to see, try this: > > http://validator.w3.org/check?uri=http%3A%2F%2Fwww.corry.biz%2Fcharset_mis >match.lasso%3Fcharset%3DISO-8859-1 > > It (correctly) returns the error: > > Using windows-1252 instead of the declared encoding iso-8859-1. > Line 22, Column 87: Unmappable byte sequence: 9d. > > So if your authors care about checking their markup with validator.w3.org, > they will also have their charset checked as well. > Well, the validator has some bugs too. If no encoding is specified for text/*, iso-8859-1 should be used (according to HTTP). I noticed already some months ago, that the validator noted wrong bugs for XHTML+RDFa instead of indicating, that this cannot be validated yet (without an optional doctype within the document, however the version indication is done with the version attribute, not a doctype). I noted errors for SVG documents too. The validator is a pretty good help in many cases (less useful for SVG, because it checks almost no attribute values), but an author still has to be careful with the results... > > The questions are more about the problem, how to indicate, > > that a ' HTML5' document really has the encoding ISO-8859-1. > > This can be important for long living documents and archival storage. > > Because in 50 or 100 or 1000 years one cannot rely on the behaviour > > of browsers of the year 2009, but it might be still possible to decode > > well defined documents with completely different programs. > > To simplify this, one should have simple and intuitive indications > > and not such a bloomer like to write 'ISO-8859-1' if you mean > > 'Windows-1252'. > > A 1000 years from now, if they do what HTML5 does now and use Windows-1252 > when ISO-8859-1 is specified, they'll be guaranteed to correctly view the > document (assuming it's in either ISO-8859-1 or Windows-1252). The same > can not be said for viewing Windows-1252 as ISO-8859-1. > > > With the current draft, one can only recommend 'HTML5'+UTF-8 > > or another format/version like XHTML+RDFa for long living > > documents and archival storage (what is not necessarily bad too, > > just something interesting to know for some people). > > UTF-8 isn't free from issues either -- I've seen Windows-1252 served as > UTF-8 which produces illegal byte sequences. Or here's an example where > the page (Windows-1252) doesn't specify a charset at all; in Firefox it's > rendered as UTF-8 with broken bytes and IE it's rendered with the correct > charset of Windows-1252: > > http://cspinet.org/new/200907301.html > > Which browser do you think they test their site with? Which browser do you > think the end user thinks is broken? > The Gecko I use indicated ISO-8859-1 as well as Konqueror, no UTF-8 or Windows-1252, Opera notes a problem (unsupported) and notes too, that Windows-1252 is used. Because this tag soup is served as text/html without encoding information, I think, the Gecko and Konqueror are perfectly correct with ISO-8859-1. There is no indication, that this might be 'HTML5'. Therefore no specific rule from the 'HTML5' draft needs to be applied. The document does not indicate at all, which version of HTML it uses; looking into the source code, I think it uses a proprietary private slang of the author (even comments are noted wrong ;o) It cannot be expected from a program, that any version information can be derived from this tag soup. The presentation cannot be wrong, because it is undefined (a viewer may use any set of rules to interprete this or reject it as nonsense; a viewer can use 'HTML5' to try to interprete this, but can use any other set of rules as well, because the author does not indicate a version informtion at all). The encoding however seems to be defined (only) by the HTTP protocol, not by the document itself. Of course, ISO-8859-1 (and Windows-1252) are only compatible with UTF-8 for a basic set of characters. This can be seen more easily with documents not in english but in other languages like german, french, spanish, pretty well covered with ISO-8859-1 but for several important characters with different encoding in ISO-8859-1 and UTF-8. Because there is no indication, that this might be XML, there is no need to think, that UTF-8 is a proper choice for decoding.
Received on Saturday, 1 August 2009 15:36:26 UTC