Re: internet media types and encoding from MURATA Makoto on 2003-04-20 (www-tag@w3.org from April 2003)

From: MURATA Makoto <murata@hokkaido.email.ne.jp>
Date: Mon, 21 Apr 2003 08:43:09 +0900
To: www-tag@w3.org
Cc: simonstl@simonstl.com, dan@dankohn.com, duerst@w3.org, Murata <murata@hokkaido.email.ne.jp>, ishida@w3.org
Message-Id: <20030421081114.3AD5.MURATA@hokkaido.email.ne.jp>
Chris Lilley wrote:

> The following language is proposed for section "4.2. Processing model"
> of the Web Architecture document. It also includes a lead-in sentence
> to be added to the overall Section "4. Representations" introductory
> material.
> http://www.w3.org/2001/tag/webarch/

I propose to omit the second para of proposed text.

	For application/xml and non-text "+xml" media types, charset
	parameters must not be produced unless they give the same information
	as the xml encoding declaration would give.

Here is the reason for my proposal.

Ideally, there should be either

  1) a single in-band encoding declaration mechanism for all 
     textual formats

or

  2) a single out-band encoding declaration mechanism for 
     all protocols,

but not both.  Everybody knows the current situation is far from this
ideal one.  HTML, CSS, and XML have different mechanisms for in-band
encoding declarations.  Most file systems and some old protocols lack
out-band encoding declaration mechanisms.  Different specs and
implementations have different precedence rules for the same format.
Many programs rely on charset sniffing.  This unfortunate situation
certainly has to be improved.

However, I do not understand  which direction the TAG is heading for. 
Since the TAG is about the technical architecture of the Web, I would
request that the TAG gives a general direction rather than making a
final decision on specific issues.  Apparently, the I18N WG should be 
involved.

I have some experiences with non-ASCII WWW pages consisting from XML,
HTML, CSS, JSP, Servlets, CGI, XSLT, and Javascript.  At present, there
are so many different mechanisms and WWW developers have to use them
appropriately.  Since this is immensely difficult, IT magazines in Japan
have often published articles about corrupted WWWW pages.  In my opinion,
the difficulty is not caused by a particular mechanism but rather by 
inconsistencies among many mechanisms.  However, the TAG has considered
the XML media type issue only.  I would request that the TAG study all
existing mechanisms and then give a general direction.

The encoding declaration issue certainly arises in non-XML formats. For
example, when the RELAX NC TC of OASIS designed the compact syntax of
RELAX NG, it wondered whether it should provide in-band declaration
mechanisms. At present, it does not have any such mechanisms, since the
RELAX NG TC felt uneasy to introduce yet another mechanism for in-band
encoding declaration.  I suppose that other compact formats (e.g.,
XQuery and probably RDF Schema) will have the same problem.  I also
know that many formats (e.g., Javascript and VBScript) rely on charset
sniffing.  

I wrote that there should ideally be a single mechanism for in-band 
encoding declaration.  For this reason, I now think that XML declarations 
are hopelessly broken.  They should have been

  @charset ISO-8859-1?;
  <?xml version="1.0"?>

rather than 

  <?xml version="1.0" charset="ISO-8859-1"?>

The former is in sync with CSS2 and correctly separates the octet
sequence layer and the character sequence layer, while XML 1.0 confuses
these two layers.  If all textual formats (including Javascript, Perl,
ruby, etc.) had adopted the same mechanism for in-band encoding
declarations, the current situation should have been at least more
consistent.  (The in-band .vs. out-band issue will not disappear,
however.)

At present, the only general-purpose mechanism is the charset parameter,
which is (vaguely) recommended by RFC 2130 and "Character Model for the
World Wide Web 1.0".  If W3C agrees that the charset parameter should
be discouraged, it is probably a good idea to explicitly say so in a
standard-track RFC, "Character Model for the World Wide Web 1.0", and
the WWW architecture document of W3C.


For the record, the following paragraph quoted from
http://lists.w3.org/Archives/Public/www-tag/2003Apr/0034.html
is incorrect.  When the charset parameter of application/xml 
is omitted, XML processors MUST use the BOM or encoding 
declarations.

> In section "3.2. Application/xml Registration" RFC 3023 also lists,
> unusually, a charset parameter, declares it autoritative in all cases
> over HTTP even when not present, 
...


[1] http://www.asahi-net.or.jp/~eb2m-mrt/charsetDetection.html
-- 
MURATA Makoto <murata@hokkaido.email.ne.jp>
Received on Sunday, 20 April 2003 20:06:59 UTC