- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Wed, 13 Mar 2019 10:22:34 +0000
- To: Frank da Cruz <fdc@columbia.edu>
- CC: Matitiahu Allouche <matitiahu.allouche@gmail.com>, "www-international@w3.org" <www-international@w3.org>
Hello Frank, On 2019/03/12 23:16, Frank da Cruz wrote: > Hi Martin, long time! Yes indeed. It must have been ten or twenty years ago last time. > Here is a page that exhibits the reported behavior: > > http://kermitproject.org/newftpclient-es.html > > Firefox "View page info" reports: > > Type: text/html > Render Mode: Standards compliance mode > Text encoding: UTF-8 > > Chrome doesn't have an equivalent query. > > The page displays legibly in both browsers, including the Spanish accented > letters, Same here, also in IE, and an old version of Opera (before they switched to the google blink code base). > and gets no errors or warnings in the validator.w3.org/nu > validator. > > There seems to be an unwritten rule that if charset="utf-8" and lang="es" > (or other ISO 8859-1 language) and the page contains single-byte 8-bit > ("right half") bytes, then they are interpreted as ISO 8859-1, and > similarly for other ISO-8859-x's, for example this one encoded in ISO > 8859-5: > > http://kermitproject.org/kermitbook-ch3-ru-test.html > > It turns out this rule (heuristic, magic) is implemented in Firefox and > Chrome, but not in Lynx, so therefore it's the web browser and not the web > sever doing it. Here I get somewhat different results. The page contains <!-- Converted to HTML5 Tue Mar 12 09:20:01 2019 --> <META charset="utf-8"> On line 4 and 5. March 12 was yesterday. I get the same date and time on three different browsers, without a proxy. The same also for wget. I also can identify sequences of two consecutive 8-bit bytes with 'od -hc' where there's an accented character, even when downloading with wget. So to me here, it looks much more like a server thing than a browser thing. > About EMACS.... In UTF-8 mode it's fine for Roman, Cyrillic, and similar > RTL writing systems but I get pretty awful results with (e.g.) Chinese, > Japanese, Korean, Arabic, and Hebrew -- characters disappear from the > screen while editing, especially when these characters are close to ASCII > characters. This is on Linux or NetBSD accessed from a Windows-based > terminal emulator (Kermit-95 of course :-) which displays the same > characters just fine outside of EMACS. Here's an example of a file that is > extremely difficult for me to edit using this setup (the language buttons > at the top and the translation credits table at the bottom): > > http://www.columbia.edu/cu/computinghistory/krawitz/ > > Obviously I could switch to some other editor but I've been using EMACS for > 40+ years and depend on countless features that other editors don't have. For this, I'd try to contact the unicode@unicode.org mailing list. Recently, there was an extensive discussion about terminal emulators and editors, in particular emacs. So you might find somebody knowledgeable there. Regards, Martin. > - Frank > > On Tue, Mar 12, 2019 at 6:42 AM Martin J. Dürst <duerst@it.aoyama.ac.jp> > wrote: > >> Hello Frank, >> >> On 2019/03/11 19:22, Frank da Cruz wrote: >>> I have tons of pages encoded in ISO-8859-1. I know they all >>> should be converted to UTF-8 but I'm putting it off because >>> I do all my editing in EMACS and it's not 100% with UTF-8 yet. >> >> If EMACS isn't 100% okay with UTF-8, then which editor would be? >> >>> By accident I changed a Spanish-language HTML5 web page to say: >>> >>> <meta charset=utf-8> >>> >>> without changing the encoding of the page. I was surprised >>> to see that the page still displays correctly in both Firefox >>> and Chrome, and when I check the page properties, both say UTF-8, >>> which tells me that the Web server isn't overriding the page's >>> internal declaration. >>> >>> No validator that I have tried tells me there is anything >>> wrong with the page. >>> >>> Do you know why the web browsers are showing this page in proper >>> Spanish when the encoding is not the declared charset? >> >> I have no idea. Is this page public? Or do you have some other page that >> is public and behaves the same? Can you give us a pointer? >> >> Regards, Martin. >> >>> Thanks, >>> >>> Frank da Cruz >>
Received on Wednesday, 13 March 2019 10:23:02 UTC