W3C home > Mailing lists > Public > www-tag@w3.org > September 2003

Re: Requesting a revision of RFC3023

From: John Cowan <jcowan@reutershealth.com>
Date: Fri, 19 Sep 2003 15:20:57 -0400
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: ietf-xml-mime@imc.org, WWW-Tag <www-tag@w3.org>
Message-ID: <20030919192057.GQ32762@skunk.reutershealth.com>

Bjoern Hoehrmann scripsit:

> Impractical. File systems commonly do not support encoding such
> information

In fact most file systems support extended attributes today.

> [A]nd even if they did, this would cause interoperability
> problems with file systems and protocols which do not provide such
> means. If you transfer the document using FTP to your web server the
> information is lost and the document will break.

No worse than today's situation, and FTP could be enhanced or abandoned
in favor of HTTP PUT.

> Further, file system
> information is typically almost invisible to authors and would thus
> have the same problem as the charset parameter. If I edit a document
> in an XML unaware text editor, change the encoding declaration and
> some text nodes and save the file, file system and encoding declaration
> are likely to contradict each other and the document would break.

No worse than today's situation.

> You are basically suggesting to change all file systems and software
> that interacts with it and expect everyone to upgrade the software and
> the file system information of all documents.

*You* are suggesting that every text file format that has ever existed --
innumerable assembly languages, C, C++, Java, Fortran, Lisp, Scheme, Prolog,
Perl, Python, Smalltalk, awk, sed, ... sh, csh, bash, zsh, ... mail archives,
news archives, ... Tex, LaTex, nroff/troff, ... -- be revised to find someplace
to stuff a charset indication, and then that every one of the billions of
documents in each of those formats be changed to carry that information.

> If an applicable solution
> may go this far, you should rather suggest to outlaw all non-Unicode
> encodings, much simpler, more consistent and more interoperable. This
> would also work if the text is not stored in the file system but rather
> generated by software, something your solution does not consider.

Indeed, which is why Plan 9 sensibly makes everything UTF-8 and Windows NT/2K/XP
makes most things UTF-16, at least under the covers.

> >Otherwise, generic text-processing tools become impossible,
> 
> They are impossible today.

The impossible does not happen, but I usefully use generic text processing
tools every hour of every working day.

> They are not trying to read the format, they are trying to read byte
> streams as character streams. If they are trying to read the format,
> they have to support that format anyway, including mechanisms to
> determine the character encoding.

Not so.  If I want to process a Fortran 77 program as text (to find the
identifiers which occur only once, e.g.) then I can use generic tools
(tr, sort, uniq) and supply the character encoding out of band.  This is
annoying, but it works.  If the tools had to understand where backpatched
Fortran 77 text hides its in-band character encoding declaration, the
results would be as I describe: huge amounts of useless hair.

> If you consider HTTP a file system, it
> already implements your solution; all text is identified using text/*
> types and either the file system provides encoding information (charset
> parameter) or text processors are required to treat the document as
> ISO-8859-1 encoded. Text processors would actually only get character
> streams from the HTTP implementation and would not have to worry about
> character encodings and stuff. Does it work? No. 

It does not work because HTTP is layered over file systems which don't bother
to support the notion of encoding declarations persistently.

-- 
"We are lost, lost.  No name, no business, no Precious, nothing.  Only empty.
Only hungry: yes, we are hungry.  A few little fishes, nassty bony little
fishes, for a poor creature, and they say death.  So wise they are; so just,
so very just."  --Gollum        jcowan@reutershealth.com  www.ccil.org/~cowan
Received on Friday, 19 September 2003 15:22:13 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 26 April 2012 12:47:21 GMT