- From: Bob Jung <bobj@mcom.com>
- Date: Mon, 9 Jan 1995 21:48:12 -0800
- To: www-mling@square.ntt.jp
- Cc: html-wg@oclc.org, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
The goal of my proposal is to (1) Provide a means for new servers & browsers to correctly handle existing (unmodified) Web data in various character set encodings. (2) Not to break the current servers and browsers (anymore than they are already) with regards to handling these code sets. The proposal does not try to fix things that are broken in existing clients/servers. I agree with Larry Masinter <masinter@parc.xerox.com> that we should replace the Accept-charset=xxx with the accept-parameter charset=xxx request header in this proposal. Larry, thanks for the update. Here are my replies to the thoughtful comments of: Daniel W. Connolly <connolly@hal.com> Ken Itakura <itakura@jrdv04.enet.dec-j.co.jp> Daniel>|7.1.1. The charset parameter Daniel>| Daniel>| [...] Daniel>| Daniel>| The default character set, which Daniel>| must be assumed in the absence of a charset parameter, is US-ASCII. Daniel> Daniel>This conflicts somewhat with your proposal. However, the RFC goes on Daniel>to say... Daniel> Daniel>| The specification for any future subtypes of "text" must specify Daniel>| whether or not they will also utilize a "charset" parameter, and may Daniel>| possibly restrict its values as well. Daniel> Daniel>I wonder if changing the default from "US-ASCII" to Daniel>"implementation-dependent" can be considered "restricting the values" Daniel>of the charset parameter. I agree that if the charset parameter is not specified, the default ***should*** be US-ASCII (or ISO8859-1, if it's been changed). Unfortunately, since charset was reserved for future use, Japanese servers had no choice but to serve non-Latin files without a charset parameter!! Why don't we enforce the default for servers using a future version of the HTTP protocol? ...and let current versions be "implementation dependent" in order to preserve backwards compatibility? Daniel>I suppose the relevant scenario is where an info provider serves up Daniel>an ISO2022-JP document with a plain old: Daniel> Content-Type: text/plain Daniel>header. I gather that this is current practice. Yes, this is the current practice (for text/html too). Additionally, some files are sent in SJIS and EUC code set encodings with the same headers. Daniel>That intent is already mucked up somewhat by the fact that normal html Daniel>documents are allowed to have bytes>127, which are normally Daniel>interpreted as per ISO8559-1. So we already have the situation where a Daniel>conforming HTTP client, say on a DOS box might retreive a text/html Daniel>document and pass it over to a conforming MIME user agent, which would Daniel>then blast it to the screen. The user would lose, cuz the bytes>127 Daniel>would get all fouled up. Yes, this situation is broken for current browsers/servers and I do not propose to fix it. Using my proposal, a new DOS browser would send: accept-charset=x-pc850 and a new server would send back: Content-Type: text/html; charset=ISO8859-1 and the new DOS browser should convert it to PC 850 for rendering. Daniel>But... back to the case of ISO2022-JP encoded data tagged as plain Daniel>"text/html". The business of slapping "charset=ISO8559-1" on the end Daniel>would muck things up. So where do we assign fault? Daniel>My vote is to assign fault at the information provider for serving up Daniel>-JP encoded data without tagging it as such. We are not trying to fix existing browsers/servers. If a new charset-enable server slaps the wrong charset header (or fails to slap on a header for non-Latin1) to a new charset-enabled browser, it is the server's fault. Daniel>So all those information providers serving up ISO2022-JP data without Daniel>tagging it as such are violating the protocol. This doesn't prevent Daniel>NetScape and other vendors from hacking in some heuristic way to Daniel>handle such a protocol violation. But the spec shouldn't condone this Daniel>behaviour. Unfortunately, the spec is lagging behind the implementations. The spec did not provide a means for the existing servers to resolve this problem. Pragmatically, I cannot introduce a server or client product that breaks established conventions. As mentioned above, can't this be handled with HTTP versioning? HTTP V1.0 && no charset paramter == implementation defined HTTP V3.0(?) && no charset parameter == IS08859-1 Daniel>Ok... so now let's suppose all the information providers agree to Daniel>clean up their act. Somehow, they have to get their HTTP servers to Daniel>tag -JP documents as such. Daniel> Daniel>How do they do this? File extension mappings? It's not relavent to the Daniel>HTML or HTTP specs, Daniel>but I think the overall proposal is incomplete Daniel>until we have a workable proposal for how to enhance the major httpd Daniel>implementations to correctly label non-ISO8559-1 documents. Yes, I explicitly left this out my proposal, but you're right, we need to discuss the implications. Ken> - Before encouragement to label correctly for non-ISO8859-1, we must Ken> give servers the way to know what they should label it. Otherwise, Ken> nobody can blame the server that distribute illegal information. Ken> Ken>The third one has the difficult problem. For the situation for the mail Ken>may simple, since the user knows he knows what encoding he use now, so Ken>he can specify the correct label before sending it. (The user who doesn't Ken>know about encoding at all must not use default encoding.) But for the Ken>situation for the web documents is difficult. I think the file extension Ken>mapping nor the classification by the directory structure is not suitable. Ken>My current Idea is 'server default' + 'directory default' + 'mapping file'. Ken>But I myself don't like my idea. Does anyone have more elegant idea? Initally, I assume most web data will be configured on a directory or file basis. I imagine most files will be configured by what directory they live in. This should be relatively easy extension for existing servers and how they parse their config files. Files like Japanese .fj newsgroups (in ISO2022-JP) are already organized by directories. So are a lot of Japanese Web pages. A web site with versions of the same files in different encodings (e.g., SJIS, EUC and JIS) or languages (e.g., English and Japanese) could create separately rooted trees with the equivalent files in each tree. The top page could say click here for SJIS/EUC/JIS or English/Japanese. File-by-file basis would be supported too, but I'd expect this to be used infrequently. Besides, this would be a Web server administrator's nightmare to maintain the configuration database. I don't like the idea of new extensions, although the current server software probably could support this. I think the data should really identify itself and not rely upon extensions. Also, we don't want to make people rename their files. For example, how are you going to rename the news archives? <TANGENT= warning not relevant to current proposal> Ultimately, I'd like the content itself to specify the encoding. One idea, is for a HTML <charset> tag that would take precedence over the MIME header: <html> <charset=xxx> <head> <title> DOCUMENT TITLE GOES HERE </title> </head> <body> <h1> MAJOR HEADING GOES HERE </h1> THE REST OF THE DOCUMENT GOES HERE </body> </html> </TANGENT> Daniel>Then web clients will start seeing: Daniel> Daniel> Content-Type: text/html; charset="ISO2022-JP" Only new charset-enabled clients will see this. Daniel>Many of them will balk at this and punt to "save to file?" mode. Daniel> Daniel>Is that a bad thing? For standard NCSA, Mosaic 2.4, no, because Daniel>it can't do any reasonable rendering of these documents anyway. Daniel> Daniel>But what about the multi-localized version of Mosaic? Does it handle Daniel>charset=... reasonably? What's the cost of enhancing it to do so and Daniel>deploying the enhanced version? Daniel> Daniel>The proposal says that the server should not give the charset= Daniel>parameter unless the client advertises support for it. I think that Daniel>will cause more trouble than its worth (see the above scenario of Daniel>untagged -JP documents being passed from HTTP clients to MIME user Daniel>agents on a DOS box.) Why is this more trouble? It's broken now and remains broken. In either case it would ignore the charset information and guess at the encoding (for most clients the guess would be 8859-1). But the purpose of NOT returning the charset parameter, has to do with not breaking the client parsing of the MIME Content-Type. If the server always slapped charset on, current clients would parse the header: Content-Type: text/html; charset=ISO8859-1 and think the content type was the entire 'text/html; charset=ISO8859-1' not just 'text/html' string and would fail to read Latin1 files!!!!! To be backwards compatible, the servers should not send the charset parameter to old browsers. Daniel>One outstanding question is: does text/html include all charset= Daniel>variations or just latin1? That is, when a client says: Daniel> Daniel> Accept: text/html Daniel> Daniel>is it implying acceptance of all variations of html, or just latin1? Daniel> Daniel>To be precise, if a client only groks latin1, and it says accept: Daniel>text/html, and the server sends ISO2022-JP encoded text, and the user Daniel>loses, is the fault in the client for not supporting ISO2022-JP, or at Daniel>the server for giving something the client didn't ask for? Daniel> Daniel>First, "text/html" is just shorthand for "text/html; charset=ISO8859-1" Daniel>so the client didn't advertise support for -JP data. Daniel> Daniel>But "giving somethign the client didn't ask for" is _not_ an HTTP Daniel>protocol viloation (at least not if you ask me; the ink still isn't Daniel>dry on the HTTP 1.0 RFC though...). It's something that the client Daniel>should be prepared for. As you put it "It's something that the client should be prepared for." I'm still assuming that accept-parameter: charset=xxx dictates if the server sends back the charset parameter. An old browser should continue to get the 2022-JP data untagged. A new charset-enabled browser should get tagged 2022-JP data even if it only advertised 8859-1. Daniel>As above, the server is bound to give "charset=ISO2022-JP" if it is Daniel>not returning latin1 data. So the client does know that it's not Daniel>getting latin1 data. It has the responsibility to interpret the Daniel>charset correctly, or save the data to a file or report "sorry, I Daniel>don't grok this data" to the user. If it blindly blasts ISO2022-JP Daniel>tagged data to an ASCII/Latin1 context, then it's broken. I agree. I've purposely read EUC and JIS pages on my Mac (SJIS), so that I could save the source and look (grok) at it later. (not a usual user...) I'm glad you bring up this point, so we can consider the implications. But what the client does in this situation should be implementation dependent and not part of this proposal. Daniel>Does this mean that charset negociation is completely unnecessary? Daniel>No. It's not necessary in any of the above scenarios, but it would be Daniel>necessary in the case where information can be provided in, for Daniel>example, unicode UCS-2, UTF-8, UTF-7, or ISO2022-JP, but the client Daniel>only groks UTF-8. Daniel> Daniel>In that case, something like: Daniel> Daniel> Accept-Charset: ISO8859-1, ISO2022-JP Daniel> Daniel>or perhaps Daniel> Daniel> Accept-Parameter: charset=ISO8859-1, charset=ISO2022-JP Daniel> Daniel>I'm not convinced of the need for the generality of the latter syntax. Daniel>Besides: we ought to allow preferences to be specified ala: Daniel> Daniel> Accept-Charset: ISO8859-1; q=1 Daniel> Accept-Charset: Unicode-UCS-2; q=1 Daniel> Accept-Charset: Unicode-UTF-8; q=0.5 Daniel> Accept-Charset: Unicode-UTF-7; q=0.4 Daniel> Accept-Charset: ISO2022-JP; q=0.2 Daniel> Daniel>which says "if you've got latin1 or UCS2, I like that just fine. If Daniel>you have UTF-8, UTF-7, or -JP, I'll take it, but I won't like it as Daniel>much." Ken>I want to add one more thing about this issue. We could have the document Ken>which uses multiple charset in future. We must define the way to label Ken>such a document. Ken>It can be like ... Ken> Content-Type: text/html; charset="ISO2022-JP", charset="ISO8859-6" Ken>Is this OK? I'd rather leave this as a possible future direction. Multilingual has had a lot of heated discussions. If we can agree on a means to support the existing mono-lingual mono-encoded Web data, that will allow us to create products to fill an immediate need. Can we phrase something that leaves this open and discuss this in another thread? Regards, Bob Bob Jung +1 415 528-2688 fax +1 415 254-2601 Netscape Communications Corp. 501 E. Middlefield Mtn View, CA 94041
Received on Tuesday, 10 January 1995 00:11:56 UTC