- From: Misha Wolf <Misha.Wolf@reuters.com>
- Date: Mon, 27 Mar 2006 01:45:42 +0100
- To: www-international@w3.org
Does anyone feel like replying? Misha -----Original Message----- From: ietf-types-bounces@alvestrand.no [mailto:ietf-types-bounces@alvestrand.no] On Behalf Of Bruce Lilly Sent: 27 March 2006 00:51 To: ietf-types@alvestrand.no Cc: ietf-822; Jacob Palme; ned+ietf-822@mrochek.com Subject: Re: Charset mandatory in unix/linux On Sun March 12 2006 10:10, ned+ietf-822@mrochek.com wrote: > > (cc'ing the ietf-types list since this doesn't seem like an appropriate topic > for ietf-822) [this response to types, cc to 822, Reply-To set to types] [Jacob Palme wrote, regarding charset] > > However, such a parameter is not mandatory in > > Unix or Linux. > > I could say the same thing about media types. File extensions or type codes are > commonly used to determine the media type. This is a huge problem that has led > to serious security glitches as well as poor user experiences. Agreed. > > This is causing more and more problems, when > > people have a mixture of files with different charsets, > > which you easily get when you download files from the > > Internet or receive them via e-mail. > > The reality is it is causing less and less problems as things gradually shift > towards Unicode-based charsets and away from the vast array of less capable > charsets. The security issues caused by non-use or misuse of media type labels > are a far bigger problem, and worse, one that doesn't appear to be going away. Agreed about the security issues w.r.t. non-use/misuse of type labels. However, I have a different perspective regarding Unicode (see below). > > > Would it be possible to get the people responsible for the > > file systems in Unix and Linux to add a mandatory charset > > attribute to all text files? > > Knowing the charset buys you very little without also knowing the media type. > You seem to be focused on plain text here and hence you're ignoring the larger > media type issue. Lots of media types have parameters and even when the media > type can be determined - it frequently cannot be done reliably - it is often > done in a way that doesn't allow additional parameters to be attached. I note that Unix and Unix-like systems don't have the notion of a "text file"; unlike some other systems, there is no distinction between "text" and "binary" files. Moreover, it is one of the characteristics of Unix that file system semantics apply not only to files per se, but also apply to devices (disk drives, communications ports, etc., and in recent implementations as interfaces to system information for processes). > > Best is probably to add a > > generalized property list to files, so that also other > > properties than charset can be added in the future. > > The ability to attach metadata to files is indeed a very useful feature, one > that has been around for decades on some platforms at least. (I'm not going to > bother with the history here.) And it is already available on Linux - at a > minimum the ext2, ext3, and XFS file systems support it. (There are probably > others but I'm too lazy to go look them up.) ReiserFS is one of the more important Linux file systems... > So in the sense of getting the filesystem to support this sort of tagging, your > problem is already solved in many cases. But this is the easy part. You now > have to get applications to agree on a specific use of metadata tags for > charsets or media types or whatever. Good luck on getting that to happen. Specifically regarding Unix-like systems, there is a long history of representing metadata as character strings containing attribute/value pairs; that is how environment variables are passed, how command-line options with parameters are passed to programs, etc. IIRC, Tom Duff had a paper in the "papers" volume of one of the recent editions of the Unix manuals about the use of pairs within graphics data files for conveying such information. So if the filesystem metadata can be represented as character strings containing attribute/value pairs, that's a very good fit (with one important caveat) to media type parameters as that is precisely what those parameters are. > > The advantage would be that programs which transport files > > across the Internet, such as e-mail, ftp and http, would > > more often use the correct charset and not munge the files > > by giving then an incorrect charset. The commonly occuring > > problem with incorrect charset would be reduced. Also local > > problems such as text editors would benefit from knowing > > the charset of a file. > > First of all, email and http do not "transfer files" per se. They transfer data > objects and each protocol defines the metadata it considers approproate to > attach to data objects. It's a bit more complicated than that specifically for text; email is fairly consistent regarding line endings -- the message format is quite specific on that point, as are the MIME specifications (2046 in particular). HTTP however does not specify line endings for text, and that is a source of various inconsistencies and problems. There is no way to specify line ending with HTTP; the protocol specification expressly permits implementation discrepancies. A potent source of trouble when transferring unlabeled or mislabeled binary content containing 0x0D octets via HTTP. [...] > This situation means that in situations where retention of file metadata is > important some sort of additional container has to be used. A vast number of > such container formats have been defined - tar files, zip files, > AppleSingle/AppleDouble, etc. These typically (and specifically for tar and zip) do not include media type information or charset or other type parameter information. The information in the tar format, for example, carries time stamps, permissions, and file type (where "type" means plain file vs. directory vs. device, etc.). Media types can be used to label such containers (IIRC there is a defined media type for the "AppleDouble" stuff), and media type parameters can be used to convey additional information; alternatively, media types could be defined to label an octet stream as a particular type, with metadata carried via parameters. [...] > So what's the bottom line? The bottom line is that you appear to be focusing on > the wrong problem along several dimensions. First, charset information > specifically isn't as interesting or essential as you claim, and the degree to > which is it interesting is dropping for a variety of reasons. Second, you > appear to have missed the larger and much more important problem of not having > correct parameterized type information available. (And we haven't even > discussed the many other sorts of metadata, like say language information, that > is also useful to have.) Third, your focus on getting metadata support into > filesystems is mostly misplaced - this is a solved problem in a lot of cases. > And fifth, you don't seem to appreciate the difficulty of getting everyone to > agree to actually use filesystem metadata to solve any of these problems. This > last is a complete showstopper and I dispair of there ever being significant > progress in this area because of it. In reverse order: Getting agreement to use metadata has several issues: aside from the slow evolutionary process of having protocols convey the metadata and having applications store and retrieve that metadata, there is the issue of some sort of standard(s) for APIs for the metadata storage/retrieval. That's not really in IETF's bailiwick; perhaps it's something that might spark some interest in ECMA or a similar SDO. So far as Unix-like systems are concerned, the use of character-stream representation of attribute/value pairs would seem to be a good fit as noted above, with one caveat; in MIME Content-Type fields, one knows how to interpret the "character strings" because they are specified to be comprised of characters from a limited repertoire -- that is, one does not need to know the "charset" of the character strings themselves; they are composed of a small, well-defined set of characters which fit in octets (in 7 bits, in fact). So long as one can count on being able to interpret the character strings as a stream of octets, I won't despair. Conversely, if the very problem that Jacob has described, viz. the inability to determine what sort of "character strings" one is looking at, extends to the metadata storage, I will abandon all hope of a solution. As far as additional metadata (language, etc.) is concerned, the attribute/value pair paradigm still works, but at a higher level. In email and HTTP, for example, there are header fields -- comprised of attribute/value pairs (specifically header field name and field body) -- which convey not only MIME media type and parameters, but other MIME fields (e.g. Content-Language) as well as non-MIME data. I do think Jacob has a valid concern; having unlabeled text files in various charsets is a big problem, and from my perspective it is getting worse, not better. The holy grail of a single unified character set that will supposedly solve the problem sounds nice until one looks at the details. Fortunately, the notion of a "charset" being somewhat more complex than the notion of a "character set" helps a little; at least knowing the charset, one can distinguish among utf-7, utf-8, utf-32be, utf-32le, utf-16be, and utf-16le, all of which have "Unicode" as the underlying character code. But that doesn't help much, precisely because "Unicode" is itself a "vast array" (ever-increasing in number) of character code sets. Saying "Unicode" doesn't tell me if that's pre-"Korean mess" (see RFC 2279) "Unicode" or post-"Korean mess" "Unicode". Or whether that's the "Unicode" that has among its design principles a uniform code width of 16 bits and an encoding strictly of text (specifically excluding musical notation), or the "Unicode" that has a much wider code width and includes non-textual cruft such as (yes, you guessed it) musical notation. Or whether it's one of the "Unicode"s that has an attempt at encoding language information (versions 3.1 and 3.2), or one of the "Unicode"s (earlier and later) that do not. And so on. Ned, you're quite right that simply adding attributes isn't a panacea; specifying "Unicode" version doesn't help. Obviously an implementation using "Unicode" version 2.1 won't be able to make sense of the "features" which crept into "Unicode" version 3.2 -- and not everybody is able to "upgrade". Aside from the much greater resources required to support an increased code width (think of small, battery-powered hand-held devices like cell phones), and the costs (monetary and otherwise) that an "upgrade" might entail, some hardware and/or software might be incapable of supporting a version change (not merely insufficient hardware resources, but perhaps a vendor has dropped support or has gone out of business). And there are related costs (monetary and otherwise) associated with changing entire suites of software, font support, etc. And as I see it, the problem is much worse than when I had to deal with ANSI X3.4 and the ISO-8859 variants; at least I COULD have software that dealt with the various encodings, fonts that support the glyphs, locales to switch among, etc. -- as far as I know, it's not even possible to have multiple versions of Unicode and to transcode between them on the same machine. Even if that is theoretically possible, the fundamental version problem remains; "utf-8" still doesn't tell me (or my software) anything about the underlying "Unicode" version. Whereas "ANSI X3.4" has a specific meaning -- it's code width didn't suddenly double one day. To put the issue in the perspective of Jacob's problem, suppose Jacob has received a text file in Korean and the issue of labeling the charset and language is solved. If it is labeled as "ISO-2022-KR", he can proceed to make sense of the file; conversely if it is labeled as "utf-7" he cannot because he lacks information to determine whether the result of the transformation to "Unicode" should be interpreted as groups of 16 bits or some other code width, as well as which code points represent various hangul characters. To find out more about Reuters visit www.about.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
Received on Monday, 27 March 2006 00:45:58 UTC