Re: iDNR, an alternative name resolution protocol from Martin J. Duerst on 1998-09-04 (uri@w3.org from September 1998)

From: Martin J. Duerst <duerst@w3.org>
Date: Fri, 04 Sep 1998 15:13:08 +0900
To: "Sam Sun" <ssun@CNRI.Reston.VA.US>
Cc: "Larry Masinter" <masinter@parc.xerox.com>, "Harald Tveit Alvestrand" <Harald.Alvestrand@maxware.no>, "Jon Davis" <jdavis@inetinit.org>, "URI distribution list" <uri@Bunyip.Com>
Message-Id: <199809040647.PAA17609@sh.w3.mag.keio.ac.jp>
At 11:16 98/09/03 -0400, Sam Sun wrote:
> Hi Martin,
> 
> Very nice to hear from you... I think what we are really interested is the
> legal HREF syntax  (under A element) in HTML document. According to the
> HTML4.0 spec, the HREF is defined as "href = uri [CT]" where "uri" is based
> on RFC1630 (I suppose it need to update to RFC2396 now).

The HTML 4.0 spec already contains this. See Reference [URI] in
http://www.w3.org/TR/REC-html40/references.html.


> So the "uri" is
> used to govern the HTML document syntax, and I guess we all agree that it's
> not practical to MANDATE UTF-8 as the only encoding allowed?

Depending on what exactly "mandate UTF-8" means, this is indeed not
practical, because it only leaves a choice between:

1) Always expand everything to %HH.
2) Having strips of UTF-8 in documents with other encodings,
   which will lead to total chaos everywhere.



> Actually, the last paragraph in section 3.5
> (ftp://ftp.parc.xerox.com/pub/masinter/draft-masinter-url-i18n-03.txt) of
> your draft also said:
> 
> "For example, a URI which contains a string in Japanese might actually
> arrive with a variety of encodings, due to the variety of
> interpretations of deployed systems. While this recommendation
> specifies a canonical encoding of Japanese using %HH-encoded UTF-8, in
> practice many URIs will be presented which contain characters encoded
> using Shift-JIS or EUC-JP, either with %HH encoding or not. Thus, to
> transition to the new regime, URI-interpreting software for Japanese
> should accept all three of the EUC-JP, Shift-JIS and UTF-8 encodings."

This paragraph currently encompases two things:

- Some URIs in Shift_JIS or so that are already out there, and/or
  browsers that interpret such things on the octet level only,
  for which there might also be servers that react when getting
  the stuff in Shift_JIS.

- The (hopefully not so far away) case that page writer sees some
  URI in a newspaper, types that into his document (which happens
  to be in Shift_JIS), and the browser interprets this as Shift_JIS,
  converts it to ISO 10646 characters and then to UTF-8 (and adds
  %HH where necessary) and then goes on from there.


> Does this really mean that URI may be entered in any native encoding? If so,
> I think it would be helpful to provide the syntax definition used to declare
> the encoding of the URI. This allows URI parsers to convert to UTF-8 (or any
> other encoding used by the protocol) correctly without checking the document
> context. Otherwise, it could be hard for URI parsers to figure out the
> encoding of any particular URI, especially in multilingual document or on
> platforms with multiple input methods installed.

Do you mean a syntax definition in octets, or in characters?
For octets, things would get extremely nasty. Even ASCII characters
have different octets in ASCII, EBCDIC, and UTF-16.
For characters, it's basically the syntax of RFC 2396, where the
general characters (the category that contains A-Z,...) are extended
by the whole ISO 10646 repertoire minus certain cases. The certain
cases can be divided into stuff that we will hopefully be able to
specify exactly (e.g. precomposed/decomposed stuff,...), and stuff
that is up to the commonsense of the users, as currently with 0O or
lI1.

And I don't think you can do without document context. An URI in an
EBCDIC document has to be processed differently when you want to
send it over HTTP than an URI in an ASCII document. And the context
is ususally available, i.e. if you don't know whether it's EBCDIC
or ASCII, you will have great difficulties for anybody to read the
document.



> For example, the URI in HTML document may be defined as:
> 
> <uri scheme> ":" [ <encoding> "@" ] <uri scheme specific string>
> 
> The <encoding> is optional, and is not needed if the <uri scheme specific
> string> uses UTF-8.

Things like these were considered. But there are a number of problems:

- What does the encoding parameter mean? Is it the encoding that
  the bytes following the "@" are currently used for, or is it
  the encoding that the server is expecting.

- If you start down that road, what about cases where different parts
  of the URI are in different encodings.

- If it's the current encoding, it will make transcoding very hard work.
  In RFC 2070, HTML was designed to be transcoded blindly.

- Currently, you don't need this for EBCDIC. What is the result if
  part of the octets are to be interpreted according to the encoding
  of the document, and others according to the tag, but these two
  octet sets overlap.

- Nobody would want to write http:us-ascii@//www.w3.org/. Why should
  that be necessary for Japanese (or whatever else)? How would it
  look on cardboard boxes?


To understand how things should work out, I would like you to have a
look at http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-URI.pdf,
in particular the top of page 8, entitled "A Trip of a Japanese URI".



Regards,   Martin.
Received on Friday, 4 September 1998 02:51:11 UTC