W3C home > Mailing lists > Public > www-international@w3.org > April to June 2001

Re: UTF-8 signature in web and email

From: Martin Duerst <duerst@w3.org>
Date: Wed, 16 May 2001 10:55:37 +0900
Message-Id: <4.2.0.58.J.20010515145957.03015100@sh.w3.mag.keio.ac.jp>
To: Roozbeh Pournader <roozbeh@sharif.edu>, Unicode List <unicode@unicode.org>, <www-international@w3.org>
Hello Roozbeh

At 04:02 01/05/15 +0430, Roozbeh Pournader wrote:

>Well, I received a UTF-8 email from Microsoft's Dr International today. It
>was a "multipart/alternative", with both the "text/plain" and "text/html"
>in UTF-8. Well, nothing interesting yet, but the interesting point was
>that the HTML version had a UTF-8 signature, but the text version lacked
>it. So, the HTML version had it three times: mime charset as UTF-8,
>UTF-8 signature, and <meta> charset markup.

This is definitely overblown. There is about 5% of a justification
for having a 'signature' on a plain-text, standalone file (the reason
being that it's somewhat easier to detect that the file is UTF-8 from the
signature than to read through the file and check the byte patterns
(which is an extremely good method to distinguish UTF-8 from everything
else)). For self-labeled data (HTML, XML, CSS) and in the context
of MIME (with the charset parameter), an UTF-8 signature doesn't
make sense at all.


>Questions:
>
>1. What are the current recommendations for these?

- When producing UTF-8 files/documents, *never* produce a 'signature'.
   There are quite some receivers that cannot deal with it, or that deal
   with it by displaying something. And there are many other problems.

- When receiving UTF-8, you probably should check for a 'signature'
   and remove it. There are too many applications that send one out,
   unfortunately.


>2. Most important of all, does W3C allow UTF-8 signatures before
>"<!DOCTYPE>"? And if yes, what should be done if they mismatch the
>charset as can be described in the <meta> tag?

For text/html, neither the HTML spec nor the IETF definition of UTF-8
(RFC 2279) says anything as far as I know. The reason was that nobody
thought about an UTF-8 signature at that time.

For XML, the 'signature' is now listed in App F.1
http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info
But this is not normative, and fairly recent, and so you should never
expect an XML processor to accept it (except as a plain character
in the file when there is no XML declaration).


Regards,   Martin.
Received on Wednesday, 16 May 2001 02:40:58 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:56 GMT