- From: Martin J. Duerst <duerst@w3.org>
- Date: Fri, 04 Sep 1998 15:13:08 +0900
- To: "Sam Sun" <ssun@CNRI.Reston.VA.US>
- Cc: "Larry Masinter" <masinter@parc.xerox.com>, "Harald Tveit Alvestrand" <Harald.Alvestrand@maxware.no>, "Jon Davis" <jdavis@inetinit.org>, "URI distribution list" <uri@Bunyip.Com>
At 11:16 98/09/03 -0400, Sam Sun wrote: > Hi Martin, > > Very nice to hear from you... I think what we are really interested is the > legal HREF syntax (under A element) in HTML document. According to the > HTML4.0 spec, the HREF is defined as "href = uri [CT]" where "uri" is based > on RFC1630 (I suppose it need to update to RFC2396 now). The HTML 4.0 spec already contains this. See Reference [URI] in http://www.w3.org/TR/REC-html40/references.html. > So the "uri" is > used to govern the HTML document syntax, and I guess we all agree that it's > not practical to MANDATE UTF-8 as the only encoding allowed? Depending on what exactly "mandate UTF-8" means, this is indeed not practical, because it only leaves a choice between: 1) Always expand everything to %HH. 2) Having strips of UTF-8 in documents with other encodings, which will lead to total chaos everywhere. > Actually, the last paragraph in section 3.5 > (ftp://ftp.parc.xerox.com/pub/masinter/draft-masinter-url-i18n-03.txt) of > your draft also said: > > "For example, a URI which contains a string in Japanese might actually > arrive with a variety of encodings, due to the variety of > interpretations of deployed systems. While this recommendation > specifies a canonical encoding of Japanese using %HH-encoded UTF-8, in > practice many URIs will be presented which contain characters encoded > using Shift-JIS or EUC-JP, either with %HH encoding or not. Thus, to > transition to the new regime, URI-interpreting software for Japanese > should accept all three of the EUC-JP, Shift-JIS and UTF-8 encodings." This paragraph currently encompases two things: - Some URIs in Shift_JIS or so that are already out there, and/or browsers that interpret such things on the octet level only, for which there might also be servers that react when getting the stuff in Shift_JIS. - The (hopefully not so far away) case that page writer sees some URI in a newspaper, types that into his document (which happens to be in Shift_JIS), and the browser interprets this as Shift_JIS, converts it to ISO 10646 characters and then to UTF-8 (and adds %HH where necessary) and then goes on from there. > Does this really mean that URI may be entered in any native encoding? If so, > I think it would be helpful to provide the syntax definition used to declare > the encoding of the URI. This allows URI parsers to convert to UTF-8 (or any > other encoding used by the protocol) correctly without checking the document > context. Otherwise, it could be hard for URI parsers to figure out the > encoding of any particular URI, especially in multilingual document or on > platforms with multiple input methods installed. Do you mean a syntax definition in octets, or in characters? For octets, things would get extremely nasty. Even ASCII characters have different octets in ASCII, EBCDIC, and UTF-16. For characters, it's basically the syntax of RFC 2396, where the general characters (the category that contains A-Z,...) are extended by the whole ISO 10646 repertoire minus certain cases. The certain cases can be divided into stuff that we will hopefully be able to specify exactly (e.g. precomposed/decomposed stuff,...), and stuff that is up to the commonsense of the users, as currently with 0O or lI1. And I don't think you can do without document context. An URI in an EBCDIC document has to be processed differently when you want to send it over HTTP than an URI in an ASCII document. And the context is ususally available, i.e. if you don't know whether it's EBCDIC or ASCII, you will have great difficulties for anybody to read the document. > For example, the URI in HTML document may be defined as: > > <uri scheme> ":" [ <encoding> "@" ] <uri scheme specific string> > > The <encoding> is optional, and is not needed if the <uri scheme specific > string> uses UTF-8. Things like these were considered. But there are a number of problems: - What does the encoding parameter mean? Is it the encoding that the bytes following the "@" are currently used for, or is it the encoding that the server is expecting. - If you start down that road, what about cases where different parts of the URI are in different encodings. - If it's the current encoding, it will make transcoding very hard work. In RFC 2070, HTML was designed to be transcoded blindly. - Currently, you don't need this for EBCDIC. What is the result if part of the octets are to be interpreted according to the encoding of the document, and others according to the tag, but these two octet sets overlap. - Nobody would want to write http:us-ascii@//www.w3.org/. Why should that be necessary for Japanese (or whatever else)? How would it look on cardboard boxes? To understand how things should work out, I would like you to have a look at http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-URI.pdf, in particular the top of page 8, entitled "A Trip of a Japanese URI". Regards, Martin.
Received on Friday, 4 September 1998 02:51:11 UTC