- From: Sam X. Sun <ssun@CNRI.Reston.VA.US>
- Date: Fri, 6 Mar 1998 06:14:31 -0500
- To: "Larry Masinter" <masinter@parc.xerox.com>, "Al Gilman" <asgilman@access.digex.net>, "Martin J. Duerst" <duerst@w3.org>
- Cc: <uri@Bunyip.Com>
Hi, Larry, I'm assuming the word "URL" in your draft is going to be replaced by the word "URI". And have some questions on it. > Section 2. Syntax. > ... For 8-bit-URLs, it is necessary to hex-encode reserved > characters, delims, unwise special characters, white space, > and characters that might otherwise be confusing when > printed and then typed. (See RFC-DURST for details.) Does the 8-bit-URL have to be UTF-8 encoded? Is the RFC-DURST reference available somewhere, Martin? > 3.1 Requirements for URL entry > ... However, for all other sequences of characters, > entry should result in characters, in logical order, > from the ISO 10646 character repertoire, encoded using > the UTF-8 method [RFC 2044], and then subsequently > encoded using the URL escape mechanism [RFC-URL-SYNTAX]. Where does the translation from local encoding, say JIS, to UTF-8, and then to URL hex escape take place? Does this mean that, in the HTML document, the URL still have to be hex encoded ASCII string? >3.3 Requirements for display of URLs > Software that displays URLs to users (or other kinds of > transcription, e.g., deciding what to print in your > magazine) should of course be robust: don't tell users > about a URL they can't type! This is where I'm not sure if we can enforce. Let's we have a community who don't speak English, and their computing environment supports their own encoding, take JIS for example. It's very likely for them to use JIS to type in or define their URI names that are intended for use within that community. My feeling is that we can't enforce everyone using ASCII or UTF-8 to encode all URIs, like we can't enforce all HTML documents to be encoded in ASCII or UTF-8. So a better way might be to default to UTF-8, but allow native encoding as long as the native encoding is specified somewhere in the name reference. It seems more nature to consider that URI syntax governs how URI to be specified in the HTML document, not necessarily what get sent over the wire. It is under this assumption that when we defined the "hdl:" syntax (ftp://ietf.org/internet-drafts/draft-sun-handle-system-00.txt, many updates due though.), we specify that it defaults to UTF-8 encoding, but when preceded with, say "jis@", what follows will be in JIS encoding. I know Martin is going to beat me up since it's "too" cumbersome, but I can't think of a better solution:). I don't think every URI needs to be readable or typable by everyone in the world, and we can't enforce everyone to do so. It's the decision to be made by whoever give the URI name. I guess this is the fundamental difference from what's specified in RFC1630. On the other hand, if we don't define a way to help specify native encoding used for a particular URI, we might end up with the current HTML mass where many natively encoded documents don't specify their encoding, and hence the users have to guess what the encoding is used when looking at non-ASCII documents. Again, I'm confused of what URI syntax is for. Is it for what specified in the HTML document, or it is for the URI that get transferred over the wire? Could we make it more clear in the draft? But even it is the syntax for what get transferred over the wire, I think it's still up to the scheme specific service to decide what the syntax is, not the URI itself. Regards, Sam ssun@cnri.reston.va.us -----Original Message----- From: Larry Masinter <masinter@parc.xerox.com> To: Al Gilman <asgilman@access.digex.net> Cc: uri@Bunyip.Com <uri@Bunyip.Com> Date: Tuesday, March 03, 1998 4:51 PM Subject: URL character set >Al Gilman ># The restriction to the current RFC-822-header-safe subset of ># ASCII is temporary under the plans as I hear them. But it does ># not make sense to open this up to a schemewise free-for-all or the ># clients will choke on the necessary library. Saying that some ># clients will support some schemes defeats the purpose. The point ># of URIs is so that more clients can support more schemes. > ># I think that > ># "Character Set" Considered Harmful ># http://www.w3.org/MarkUp/html-spec/charset-harmful.html > ># may be relevant here. > >Check out ftp://ds.internic.net/draft-masinter-url-i18n-00.txt > >I think we might want to remove the idea of a 'new kind of URL', though, >and call it an EURI. If I get comments this week, I'll try to incorporate >them in a revised version. > > >
Received on Friday, 6 March 1998 06:24:13 UTC