- From: Michael \(michka\) Kaplan <michka@trigeminal.com>
- Date: Wed, 16 May 2001 04:33:57 -0700
- To: <duerst@w3.org>, "Roozbeh Pournader" <roozbeh@sharif.edu>, "Unicode List" <unicode@unicode.org>, <www-international@w3.org>
It would be most likely that "Dr. International" (drintl@microsoft.com) sent the mail from Microsoft did so from his/her Outlook machine (probably Outlook 2002, I do not think Outlook 2000 ever did this). Perhaps someone could follow up with the Outlook folks on their decision to include a BOM at the beginning of UTF-8 section of HTML mail? Assuming that Dr. International is on the Unicode List, then he/she might be the best person to follow up! :-) Clearly there is no standard suggesting such a thing, and while I do see Martin's suggestions below as something of a reversal from other people's ideas of best practices, the BOM for UTF-8 and other encodings is clearly intended for cases of plain text, not text that has a higher-level protocol that contains encoding information. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ ----- Original Message ----- From: "Martin Duerst" <duerst@w3.org> To: "Roozbeh Pournader" <roozbeh@sharif.edu>; "Unicode List" <unicode@unicode.org>; <www-international@w3.org> Sent: Tuesday, May 15, 2001 6:55 PM Subject: Re: UTF-8 signature in web and email > Hello Roozbeh > > At 04:02 01/05/15 +0430, Roozbeh Pournader wrote: > > >Well, I received a UTF-8 email from Microsoft's Dr International today. It > >was a "multipart/alternative", with both the "text/plain" and "text/html" > >in UTF-8. Well, nothing interesting yet, but the interesting point was > >that the HTML version had a UTF-8 signature, but the text version lacked > >it. So, the HTML version had it three times: mime charset as UTF-8, > >UTF-8 signature, and <meta> charset markup. > > This is definitely overblown. There is about 5% of a justification > for having a 'signature' on a plain-text, standalone file (the reason > being that it's somewhat easier to detect that the file is UTF-8 from the > signature than to read through the file and check the byte patterns > (which is an extremely good method to distinguish UTF-8 from everything > else)). For self-labeled data (HTML, XML, CSS) and in the context > of MIME (with the charset parameter), an UTF-8 signature doesn't > make sense at all. > > > >Questions: > > > >1. What are the current recommendations for these? > > - When producing UTF-8 files/documents, *never* produce a 'signature'. > There are quite some receivers that cannot deal with it, or that deal > with it by displaying something. And there are many other problems. > > - When receiving UTF-8, you probably should check for a 'signature' > and remove it. There are too many applications that send one out, > unfortunately. > > > >2. Most important of all, does W3C allow UTF-8 signatures before > >"<!DOCTYPE>"? And if yes, what should be done if they mismatch the > >charset as can be described in the <meta> tag? > > For text/html, neither the HTML spec nor the IETF definition of UTF-8 > (RFC 2279) says anything as far as I know. The reason was that nobody > thought about an UTF-8 signature at that time. > > For XML, the 'signature' is now listed in App F.1 > http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info > But this is not normative, and fairly recent, and so you should never > expect an XML processor to accept it (except as a plain character > in the file when there is no XML declaration). > > > Regards, Martin. > > > >
Received on Wednesday, 16 May 2001 07:34:44 UTC