Re: UTF-8 signature in web and email from Michael $michka$ Kaplan on 2001-05-16 (www-international@w3.org from April to June 2001)

From: Michael $michka$ Kaplan <michka@trigeminal.com>
Date: Wed, 16 May 2001 04:33:57 -0700
To: <duerst@w3.org>, "Roozbeh Pournader" <roozbeh@sharif.edu>, "Unicode List" <unicode@unicode.org>, <www-international@w3.org>
Message-ID: <002a01c0ddfc$1a2037b0$919335d8@redmond.corp.microsoft.com>

It would be most likely that "Dr. International" (drintl@microsoft.com) sent
the mail from Microsoft did so from his/her Outlook machine (probably
Outlook 2002, I do not think Outlook 2000 ever did this). Perhaps someone
could follow up with the Outlook folks on their decision to include a BOM at
the beginning of UTF-8 section of HTML mail? Assuming that Dr. International
is on the Unicode List, then he/she might be the best person to follow up!
:-)

Clearly there is no standard suggesting such a thing, and while I do see
Martin's suggestions below as something of a reversal from other people's
ideas of best practices, the BOM for UTF-8 and other encodings is clearly
intended for cases of plain text, not text that has a higher-level protocol
that contains encoding information.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Roozbeh Pournader" <roozbeh@sharif.edu>; "Unicode List"
<unicode@unicode.org>; <www-international@w3.org>
Sent: Tuesday, May 15, 2001 6:55 PM
Subject: Re: UTF-8 signature in web and email


> Hello Roozbeh
>
> At 04:02 01/05/15 +0430, Roozbeh Pournader wrote:
>
> >Well, I received a UTF-8 email from Microsoft's Dr International today.
It
> >was a "multipart/alternative", with both the "text/plain" and "text/html"
> >in UTF-8. Well, nothing interesting yet, but the interesting point was
> >that the HTML version had a UTF-8 signature, but the text version lacked
> >it. So, the HTML version had it three times: mime charset as UTF-8,
> >UTF-8 signature, and <meta> charset markup.
>
> This is definitely overblown. There is about 5% of a justification
> for having a 'signature' on a plain-text, standalone file (the reason
> being that it's somewhat easier to detect that the file is UTF-8 from the
> signature than to read through the file and check the byte patterns
> (which is an extremely good method to distinguish UTF-8 from everything
> else)). For self-labeled data (HTML, XML, CSS) and in the context
> of MIME (with the charset parameter), an UTF-8 signature doesn't
> make sense at all.
>
>
> >Questions:
> >
> >1. What are the current recommendations for these?
>
> - When producing UTF-8 files/documents, *never* produce a 'signature'.
>    There are quite some receivers that cannot deal with it, or that deal
>    with it by displaying something. And there are many other problems.
>
> - When receiving UTF-8, you probably should check for a 'signature'
>    and remove it. There are too many applications that send one out,
>    unfortunately.
>
>
> >2. Most important of all, does W3C allow UTF-8 signatures before
> >"<!DOCTYPE>"? And if yes, what should be done if they mismatch the
> >charset as can be described in the <meta> tag?
>
> For text/html, neither the HTML spec nor the IETF definition of UTF-8
> (RFC 2279) says anything as far as I know. The reason was that nobody
> thought about an UTF-8 signature at that time.
>
> For XML, the 'signature' is now listed in App F.1
> http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info
> But this is not normative, and fairly recent, and so you should never
> expect an XML processor to accept it (except as a plain character
> in the file when there is no XML declaration).
>
>
> Regards,   Martin.
>
>
>
>

Received on Wednesday, 16 May 2001 07:34:44 UTC