- From: Najib Tounsi <ntounsi@gmail.com>
- Date: Wed, 13 Mar 2019 19:35:37 +0100
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>, Frank da Cruz <fdc@columbia.edu>
- Cc: Matitiahu Allouche <matitiahu.allouche@gmail.com>, "www-international@w3.org" <www-international@w3.org>
- Message-ID: <74adc1d8-d776-bdff-abb3-e7f6d8916f75@emi.ac.ma>
Hello all, It seem that that the server returns content-type without charset (only text/html). Trying 166.84.62.124... Connected to kermitproject.org. Escape character is '^]'. HEAD / HTTP/1.1 host:example HTTP/1.1 200 OK Date: Wed, 13 Mar 2019 18:04:11 GMT Server: Apache/2.4.38 (Unix) OpenSSL/1.0.1u Accept-Ranges: bytes Connection: close _*Content-Type: text/html*_ Connection closed by foreign host. Same with GET request with in addition: Transfer-Encoding: chunked May be HTTP 1.1 default charset is ISO-8859-1. Hence your result. But page declaration (UTF-8) should take precedence in the browser. Najib On 3/13/19 11:22, Martin J. Dürst wrote: > Hello Frank, > > On 2019/03/12 23:16, Frank da Cruz wrote: >> Hi Martin, long time! > Yes indeed. It must have been ten or twenty years ago last time. > >> Here is a page that exhibits the reported behavior: >> >> http://kermitproject.org/newftpclient-es.html >> >> Firefox "View page info" reports: >> >> Type: text/html >> Render Mode: Standards compliance mode >> Text encoding: UTF-8 >> >> Chrome doesn't have an equivalent query. >> >> The page displays legibly in both browsers, including the Spanish accented >> letters, > Same here, also in IE, and an old version of Opera (before they switched > to the google blink code base). > >> and gets no errors or warnings in the validator.w3.org/nu >> validator. >> >> There seems to be an unwritten rule that if charset="utf-8" and lang="es" >> (or other ISO 8859-1 language) and the page contains single-byte 8-bit >> ("right half") bytes, then they are interpreted as ISO 8859-1, and >> similarly for other ISO-8859-x's, for example this one encoded in ISO >> 8859-5: >> >> http://kermitproject.org/kermitbook-ch3-ru-test.html >> >> It turns out this rule (heuristic, magic) is implemented in Firefox and >> Chrome, but not in Lynx, so therefore it's the web browser and not the web >> sever doing it. > Here I get somewhat different results. The page contains > > <!-- Converted to HTML5 Tue Mar 12 09:20:01 2019 --> > <META charset="utf-8"> > > On line 4 and 5. March 12 was yesterday. I get the same date and time on > three different browsers, without a proxy. The same also for wget. I > also can identify sequences of two consecutive 8-bit bytes with 'od -hc' > where there's an accented character, even when downloading with wget. So > to me here, it looks much more like a server thing than a browser thing. > > >> About EMACS.... In UTF-8 mode it's fine for Roman, Cyrillic, and similar >> RTL writing systems but I get pretty awful results with (e.g.) Chinese, >> Japanese, Korean, Arabic, and Hebrew -- characters disappear from the >> screen while editing, especially when these characters are close to ASCII >> characters. This is on Linux or NetBSD accessed from a Windows-based >> terminal emulator (Kermit-95 of course :-) which displays the same >> characters just fine outside of EMACS. Here's an example of a file that is >> extremely difficult for me to edit using this setup (the language buttons >> at the top and the translation credits table at the bottom): >> >> http://www.columbia.edu/cu/computinghistory/krawitz/ >> >> Obviously I could switch to some other editor but I've been using EMACS for >> 40+ years and depend on countless features that other editors don't have. > For this, I'd try to contact the unicode@unicode.org mailing list. > Recently, there was an extensive discussion about terminal emulators and > editors, in particular emacs. So you might find somebody knowledgeable > there. > > Regards, Martin. > >> - Frank >> >> On Tue, Mar 12, 2019 at 6:42 AM Martin J. Dürst <duerst@it.aoyama.ac.jp> >> wrote: >> >>> Hello Frank, >>> >>> On 2019/03/11 19:22, Frank da Cruz wrote: >>>> I have tons of pages encoded in ISO-8859-1. I know they all >>>> should be converted to UTF-8 but I'm putting it off because >>>> I do all my editing in EMACS and it's not 100% with UTF-8 yet. >>> If EMACS isn't 100% okay with UTF-8, then which editor would be? >>> >>>> By accident I changed a Spanish-language HTML5 web page to say: >>>> >>>> <meta charset=utf-8> >>>> >>>> without changing the encoding of the page. I was surprised >>>> to see that the page still displays correctly in both Firefox >>>> and Chrome, and when I check the page properties, both say UTF-8, >>>> which tells me that the Web server isn't overriding the page's >>>> internal declaration. >>>> >>>> No validator that I have tried tells me there is anything >>>> wrong with the page. >>>> >>>> Do you know why the web browsers are showing this page in proper >>>> Spanish when the encoding is not the declared charset? >>> I have no idea. Is this page public? Or do you have some other page that >>> is public and behaves the same? Can you give us a pointer? >>> >>> Regards, Martin. >>> >>>> Thanks, >>>> >>>> Frank da Cruz -- Najib TOUNSI (ntounsi at emi.ac.ma) W3C Office in Morocco (http://www.w3c.org.ma/) Ecole Mohammadia d'Ingénieurs, BP. 765 Agdal-RABAT Morocco Mobile: +212 (0) 661 22 00 30
Received on Wednesday, 13 March 2019 18:36:05 UTC