Re: [Moderator Action] Odd thing I noticed from Najib Tounsi on 2019-03-13 (www-international@w3.org from January to March 2019)

From: Najib Tounsi <ntounsi@gmail.com>
Date: Wed, 13 Mar 2019 19:35:37 +0100
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>, Frank da Cruz <fdc@columbia.edu>
Cc: Matitiahu Allouche <matitiahu.allouche@gmail.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <74adc1d8-d776-bdff-abb3-e7f6d8916f75@emi.ac.ma>
Hello all,

It seem that that the server returns content-type without charset (only 
text/html).

    Trying 166.84.62.124...
    Connected to kermitproject.org.
    Escape character is '^]'.
    HEAD / HTTP/1.1
    host:example

    HTTP/1.1 200 OK
    Date: Wed, 13 Mar 2019 18:04:11 GMT
    Server: Apache/2.4.38 (Unix) OpenSSL/1.0.1u
    Accept-Ranges: bytes
    Connection: close
    _*Content-Type: text/html*_

    Connection closed by foreign host.

Same with GET request with in addition:

    Transfer-Encoding: chunked

May be HTTP 1.1 default charset is ISO-8859-1. Hence your result.
But page declaration (UTF-8) should take precedence in the browser.

Najib

On 3/13/19 11:22, Martin J. Dürst wrote:
> Hello Frank,
>
> On 2019/03/12 23:16, Frank da Cruz wrote:
>> Hi Martin, long time!
> Yes indeed. It must have been ten or twenty years ago last time.
>
>> Here is a page that exhibits the reported behavior:
>>
>>     http://kermitproject.org/newftpclient-es.html
>>
>> Firefox "View page info" reports:
>>
>>     Type: text/html
>>     Render Mode: Standards compliance mode
>>     Text encoding: UTF-8
>>
>> Chrome doesn't have an equivalent query.
>>
>> The page displays legibly in both browsers, including the Spanish accented
>> letters,
> Same here, also in IE, and an old version of Opera (before they switched
> to the google blink code base).
>
>> and gets no errors or warnings in the validator.w3.org/nu
>> validator.
>>
>> There seems to be an unwritten rule that if charset="utf-8" and lang="es"
>> (or other ISO 8859-1 language) and the page contains single-byte 8-bit
>> ("right half") bytes, then they are interpreted as ISO 8859-1, and
>> similarly for other ISO-8859-x's, for example this one encoded in ISO
>> 8859-5:
>>
>>     http://kermitproject.org/kermitbook-ch3-ru-test.html
>>
>> It turns out this rule (heuristic, magic)  is implemented in Firefox and
>> Chrome, but not in Lynx, so therefore it's the web browser and not the web
>> sever doing it.
> Here I get somewhat different results. The page contains
>
> <!-- Converted to HTML5 Tue Mar 12 09:20:01 2019 -->
> <META charset="utf-8">
>
> On line 4 and 5. March 12 was yesterday. I get the same date and time on
> three different browsers, without a proxy. The same also for wget. I
> also can identify sequences of two consecutive 8-bit bytes with 'od -hc'
> where there's an accented character, even when downloading with wget. So
> to me here, it looks much more like a server thing than a browser thing.
>
>
>> About EMACS.... In UTF-8 mode it's fine for Roman, Cyrillic, and similar
>> RTL writing systems but I get pretty awful results with (e.g.) Chinese,
>> Japanese, Korean, Arabic, and Hebrew -- characters disappear from the
>> screen while editing, especially when these characters are close to ASCII
>> characters.  This is on Linux or NetBSD accessed from a Windows-based
>> terminal emulator (Kermit-95 of course :-) which displays the same
>> characters just fine outside of EMACS.  Here's an example of a file that is
>> extremely difficult for me to edit using this setup (the language buttons
>> at the top and the translation credits table at the bottom):
>>
>>     http://www.columbia.edu/cu/computinghistory/krawitz/
>>
>> Obviously I could switch to some other editor but I've been using EMACS for
>> 40+ years and depend on countless features that other editors don't have.
> For this, I'd try to contact the unicode@unicode.org mailing list.
> Recently, there was an extensive discussion about terminal emulators and
> editors, in particular emacs. So you might find somebody knowledgeable
> there.
>
> Regards,    Martin.
>
>> - Frank
>>
>> On Tue, Mar 12, 2019 at 6:42 AM Martin J. Dürst <duerst@it.aoyama.ac.jp>
>> wrote:
>>
>>> Hello Frank,
>>>
>>> On 2019/03/11 19:22, Frank da Cruz wrote:
>>>> I have tons of pages encoded in ISO-8859-1.  I know they all
>>>> should be converted to UTF-8 but I'm putting it off because
>>>> I do all my editing in EMACS and it's not 100% with UTF-8 yet.
>>> If EMACS isn't 100% okay with UTF-8, then which editor would be?
>>>
>>>> By accident I changed a Spanish-language HTML5 web page to say:
>>>>
>>>> <meta charset=utf-8>
>>>>
>>>> without changing the encoding of the page.  I was surprised
>>>> to see that the page still displays correctly in both Firefox
>>>> and Chrome, and when I check the page properties, both say UTF-8,
>>>> which tells me that the Web server isn't overriding the page's
>>>> internal declaration.
>>>>
>>>> No validator that I have tried tells me there is anything
>>>> wrong with the page.
>>>>
>>>> Do you know why the web browsers are showing this page in proper
>>>> Spanish when the encoding is not the declared charset?
>>> I have no idea. Is this page public? Or do you have some other page that
>>> is public and behaves the same? Can you give us a pointer?
>>>
>>> Regards,   Martin.
>>>
>>>> Thanks,
>>>>
>>>> Frank da Cruz

-- 
Najib TOUNSI (ntounsi at emi.ac.ma)
W3C Office in Morocco (http://www.w3c.org.ma/)
Ecole Mohammadia d'Ingénieurs, BP. 765 Agdal-RABAT Morocco
Mobile: +212 (0) 661 22 00 30
Received on Wednesday, 13 March 2019 18:36:05 UTC