Re: [Moderator Action] Odd thing I noticed

Hello Frank,

On 2019/03/12 23:16, Frank da Cruz wrote:
> Hi Martin, long time!

Yes indeed. It must have been ten or twenty years ago last time.

> Here is a page that exhibits the reported behavior:
> 
>    http://kermitproject.org/newftpclient-es.html

> 
> Firefox "View page info" reports:
> 
>    Type: text/html
>    Render Mode: Standards compliance mode
>    Text encoding: UTF-8
> 
> Chrome doesn't have an equivalent query.
> 
> The page displays legibly in both browsers, including the Spanish accented
> letters,

Same here, also in IE, and an old version of Opera (before they switched 
to the google blink code base).

> and gets no errors or warnings in the validator.w3.org/nu
> validator.
> 
> There seems to be an unwritten rule that if charset="utf-8" and lang="es"
> (or other ISO 8859-1 language) and the page contains single-byte 8-bit
> ("right half") bytes, then they are interpreted as ISO 8859-1, and
> similarly for other ISO-8859-x's, for example this one encoded in ISO
> 8859-5:
> 
>    http://kermitproject.org/kermitbook-ch3-ru-test.html

> 
> It turns out this rule (heuristic, magic)  is implemented in Firefox and
> Chrome, but not in Lynx, so therefore it's the web browser and not the web
> sever doing it.

Here I get somewhat different results. The page contains

<!-- Converted to HTML5 Tue Mar 12 09:20:01 2019 -->
<META charset="utf-8">

On line 4 and 5. March 12 was yesterday. I get the same date and time on 
three different browsers, without a proxy. The same also for wget. I 
also can identify sequences of two consecutive 8-bit bytes with 'od -hc' 
where there's an accented character, even when downloading with wget. So 
to me here, it looks much more like a server thing than a browser thing.


> About EMACS.... In UTF-8 mode it's fine for Roman, Cyrillic, and similar
> RTL writing systems but I get pretty awful results with (e.g.) Chinese,
> Japanese, Korean, Arabic, and Hebrew -- characters disappear from the
> screen while editing, especially when these characters are close to ASCII
> characters.  This is on Linux or NetBSD accessed from a Windows-based
> terminal emulator (Kermit-95 of course :-) which displays the same
> characters just fine outside of EMACS.  Here's an example of a file that is
> extremely difficult for me to edit using this setup (the language buttons
> at the top and the translation credits table at the bottom):
> 
>    http://www.columbia.edu/cu/computinghistory/krawitz/

> 
> Obviously I could switch to some other editor but I've been using EMACS for
> 40+ years and depend on countless features that other editors don't have.

For this, I'd try to contact the unicode@unicode.org mailing list. 
Recently, there was an extensive discussion about terminal emulators and 
editors, in particular emacs. So you might find somebody knowledgeable 
there.

Regards,    Martin.

> - Frank
> 
> On Tue, Mar 12, 2019 at 6:42 AM Martin J. Dürst <duerst@it.aoyama.ac.jp>
> wrote:
> 
>> Hello Frank,
>>
>> On 2019/03/11 19:22, Frank da Cruz wrote:
>>> I have tons of pages encoded in ISO-8859-1.  I know they all
>>> should be converted to UTF-8 but I'm putting it off because
>>> I do all my editing in EMACS and it's not 100% with UTF-8 yet.
>>
>> If EMACS isn't 100% okay with UTF-8, then which editor would be?
>>
>>> By accident I changed a Spanish-language HTML5 web page to say:
>>>
>>> <meta charset=utf-8>
>>>
>>> without changing the encoding of the page.  I was surprised
>>> to see that the page still displays correctly in both Firefox
>>> and Chrome, and when I check the page properties, both say UTF-8,
>>> which tells me that the Web server isn't overriding the page's
>>> internal declaration.
>>>
>>> No validator that I have tried tells me there is anything
>>> wrong with the page.
>>>
>>> Do you know why the web browsers are showing this page in proper
>>> Spanish when the encoding is not the declared charset?
>>
>> I have no idea. Is this page public? Or do you have some other page that
>> is public and behaves the same? Can you give us a pointer?
>>
>> Regards,   Martin.
>>
>>> Thanks,
>>>
>>> Frank da Cruz
>>

Received on Wednesday, 13 March 2019 10:23:02 UTC