- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Wed, 13 Mar 2019 10:22:34 +0000
- To: Frank da Cruz <fdc@columbia.edu>
- CC: Matitiahu Allouche <matitiahu.allouche@gmail.com>, "www-international@w3.org" <www-international@w3.org>
Hello Frank,
On 2019/03/12 23:16, Frank da Cruz wrote:
> Hi Martin, long time!
Yes indeed. It must have been ten or twenty years ago last time.
> Here is a page that exhibits the reported behavior:
>
> http://kermitproject.org/newftpclient-es.html
>
> Firefox "View page info" reports:
>
> Type: text/html
> Render Mode: Standards compliance mode
> Text encoding: UTF-8
>
> Chrome doesn't have an equivalent query.
>
> The page displays legibly in both browsers, including the Spanish accented
> letters,
Same here, also in IE, and an old version of Opera (before they switched
to the google blink code base).
> and gets no errors or warnings in the validator.w3.org/nu
> validator.
>
> There seems to be an unwritten rule that if charset="utf-8" and lang="es"
> (or other ISO 8859-1 language) and the page contains single-byte 8-bit
> ("right half") bytes, then they are interpreted as ISO 8859-1, and
> similarly for other ISO-8859-x's, for example this one encoded in ISO
> 8859-5:
>
> http://kermitproject.org/kermitbook-ch3-ru-test.html
>
> It turns out this rule (heuristic, magic) is implemented in Firefox and
> Chrome, but not in Lynx, so therefore it's the web browser and not the web
> sever doing it.
Here I get somewhat different results. The page contains
<!-- Converted to HTML5 Tue Mar 12 09:20:01 2019 -->
<META charset="utf-8">
On line 4 and 5. March 12 was yesterday. I get the same date and time on
three different browsers, without a proxy. The same also for wget. I
also can identify sequences of two consecutive 8-bit bytes with 'od -hc'
where there's an accented character, even when downloading with wget. So
to me here, it looks much more like a server thing than a browser thing.
> About EMACS.... In UTF-8 mode it's fine for Roman, Cyrillic, and similar
> RTL writing systems but I get pretty awful results with (e.g.) Chinese,
> Japanese, Korean, Arabic, and Hebrew -- characters disappear from the
> screen while editing, especially when these characters are close to ASCII
> characters. This is on Linux or NetBSD accessed from a Windows-based
> terminal emulator (Kermit-95 of course :-) which displays the same
> characters just fine outside of EMACS. Here's an example of a file that is
> extremely difficult for me to edit using this setup (the language buttons
> at the top and the translation credits table at the bottom):
>
> http://www.columbia.edu/cu/computinghistory/krawitz/
>
> Obviously I could switch to some other editor but I've been using EMACS for
> 40+ years and depend on countless features that other editors don't have.
For this, I'd try to contact the unicode@unicode.org mailing list.
Recently, there was an extensive discussion about terminal emulators and
editors, in particular emacs. So you might find somebody knowledgeable
there.
Regards, Martin.
> - Frank
>
> On Tue, Mar 12, 2019 at 6:42 AM Martin J. Dürst <duerst@it.aoyama.ac.jp>
> wrote:
>
>> Hello Frank,
>>
>> On 2019/03/11 19:22, Frank da Cruz wrote:
>>> I have tons of pages encoded in ISO-8859-1. I know they all
>>> should be converted to UTF-8 but I'm putting it off because
>>> I do all my editing in EMACS and it's not 100% with UTF-8 yet.
>>
>> If EMACS isn't 100% okay with UTF-8, then which editor would be?
>>
>>> By accident I changed a Spanish-language HTML5 web page to say:
>>>
>>> <meta charset=utf-8>
>>>
>>> without changing the encoding of the page. I was surprised
>>> to see that the page still displays correctly in both Firefox
>>> and Chrome, and when I check the page properties, both say UTF-8,
>>> which tells me that the Web server isn't overriding the page's
>>> internal declaration.
>>>
>>> No validator that I have tried tells me there is anything
>>> wrong with the page.
>>>
>>> Do you know why the web browsers are showing this page in proper
>>> Spanish when the encoding is not the declared charset?
>>
>> I have no idea. Is this page public? Or do you have some other page that
>> is public and behaves the same? Can you give us a pointer?
>>
>> Regards, Martin.
>>
>>> Thanks,
>>>
>>> Frank da Cruz
>>
Received on Wednesday, 13 March 2019 10:23:02 UTC