Re: Requesting a revision of RFC3023 from Bjoern Hoehrmann on 2003-09-19 (www-tag@w3.org from September 2003)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Fri, 19 Sep 2003 20:55:39 +0200
To: John Cowan <jcowan@reutershealth.com>
Cc: ietf-xml-mime@imc.org, WWW-Tag <www-tag@w3.org>
Message-ID: <3f7e48c4.1751679261@smtp.bjoern.hoehrmann.de>

* John Cowan wrote:
>Rather than having thousands of ad hoc mechanisms for encoding declarations
>in each of the thousands of text formats now extant, file systems should have
>a convenient mechanism for recording the encoding of each file, and character
>processing libraries should have convenient reading and writing operations that
>do the necessary conversions.

Impractical. File systems commonly do not support encoding such
information and even if they did, this would cause interoperability
problems with file systems and protocols which do not provide such
means. If you transfer the document using FTP to your web server the
information is lost and the document will break. Further, file system
information is typically almost invisible to authors and would thus
have the same problem as the charset parameter. If I edit a document
in an XML unaware text editor, change the encoding declaration and
some text nodes and save the file, file system and encoding declaration
are likely to contradict each other and the document would break.

You are basically suggesting to change all file systems and software
that interacts with it and expect everyone to upgrade the software and
the file system information of all documents. If an applicable solution
may go this far, you should rather suggest to outlaw all non-Unicode
encodings, much simpler, more consistent and more interoperable. This
would also work if the text is not stored in the file system but rather
generated by software, something your solution does not consider.

>Otherwise, generic text-processing tools become impossible,

They are impossible today.

>because each tool has to have a vast library that understands the
>mechanics of the encoding declaration specific to the format it is trying to
>read.

They are not trying to read the format, they are trying to read byte
streams as character streams. If they are trying to read the format,
they have to support that format anyway, including mechanisms to
determine the character encoding. If you consider HTTP a file system, it
already implements your solution; all text is identified using text/*
types and either the file system provides encoding information (charset
parameter) or text processors are required to treat the document as
ISO-8859-1 encoded. Text processors would actually only get character
streams from the HTTP implementation and would not have to worry about
character encodings and stuff. Does it work? No. Especially not because
W3C publishes Recommendations that make it impossible to write
conforming HTTP implementations. That way madness lies.

Received on Friday, 19 September 2003 14:58:54 UTC