Re: iDNR, an alternative name resolution protocol

Sam Sun (ssun@CNRI.Reston.VA.US)
Fri, 4 Sep 1998 17:24:47 -0400

Message-ID: <063d01bdd84a$71a944a0$1c1e1b0a@ssun.CNRI.Reston.Va.US>
From: "Sam Sun" <ssun@CNRI.Reston.VA.US>
To: "Martin J. Duerst" <>
Cc: "URI distribution list" <uri@Bunyip.Com>
Date: Fri, 4 Sep 1998 17:24:47 -0400
Subject: Re: iDNR, an alternative name resolution protocol


We are in the middle of drafting the URL Syntax for the Handle System, which
is the syntax for Handles referenced in HTML document, not the syntax for
Handles transmitted over the wire (the later uses UTF-8). Because not every
customer of the Handle System uses UTF-8 capable platform, we have to define
the syntax so that they can enter the Handles in terms of their native
encoding. (Customers will not enter hex encoded name because one, they won't
know the encoding, two, they don't look right.) The Handle System Resolver
will have to do the job of translating the native encoding to the protocol
encoding. For the Resolver to do this, the native encoding has to be
specified with the Handle reference. Otherwise, if we simply default the
encoding of the Handle reference to be the encoding of the surrounding
context of the HTML document, the Resolve will have to pass the document.
Also, it might break if user copy and paste the reference from one browser
window to another.

>Do you mean a syntax definition in octets, or in characters?
>For octets, things would get extremely nasty. Even ASCII characters
>have different octets in ASCII, EBCDIC, and UTF-16.
>For characters, it's basically the syntax of RFC 2396, where the
>general characters (the category that contains A-Z,...) are extended
>by the whole ISO 10646 repertoire minus certain cases. The certain
>cases can be divided into stuff that we will hopefully be able to
>specify exactly (e.g. precomposed/decomposed stuff,...), and stuff
>that is up to the commonsense of the users, as currently with 0O or

The "URI interpreter" might want to view URIs as octets, except that we have
to define a small set of octets that are used as separators (is this
doable?). However, the encoding information is necessary for the "protocol
specific interpreter" to translate the URI into its protocol encoding.

RFC2396 doesn't address the syntax we are discussing here. Section 1 of
RFC2396 states that "This document does not discuss the issues and
recommendation for dealing with characters outside of the US-ASCII character
set [ASCII]; those recommendations are discussed in a separate document."
And I suppose this "separate document" is the draft you are working on:).

>> For example, the URI in HTML document may be defined as:
>> <uri scheme> ":" [ <encoding> "@" ] <uri scheme specific string>
>> The <encoding> is optional, and is not needed if the <uri scheme specific
>> string> uses UTF-8.
>Things like these were considered. But there are a number of problems:
>- What does the encoding parameter mean? Is it the encoding that
>  the bytes following the "@" are currently used for, or is it
>  the encoding that the server is expecting.

The encoding means that the bytes following the "@" are currently used for.
It's not what to be sent to the server over the wire. The message sent to
the server is generated by the protocol specific filter, and should follow
the protocol specification identified by the URI scheme.

>- If you start down that road, what about cases where different parts
>  of the URI are in different encodings.

It will not support mixed encoding. It's the limitation, but it's a step
more flexible than allowing one encoding (i.e. UTF-8) only. Unlike HTML
document, identifiers probably don't get mixed encoding that much anyway.

>- If it's the current encoding, it will make transcoding very hard work.
>  In RFC 2070, HTML was designed to be transcoded blindly.
>- Currently, you don't need this for EBCDIC. What is the result if
>  part of the octets are to be interpreted according to the encoding
>  of the document, and others according to the tag, but these two
>  octet sets overlap.

I think I don't quite get your points here. By transcoding, do you mean to
translate from one encoding to another? The HTML interpreter doesn't
transcode any HREF reference. It's the protocol specific filter that does
the work.

>- Nobody would want to write http:us-ascii@// Why should
>  that be necessary for Japanese (or whatever else)?

This is where it ENCOURAGES using UTF-8, since no extra typing would be

> How would it  look on cardboard boxes?

I don't know the answer to this. Maybe we can work it out? It seems to me
that encoding is a computing issue. When you type in, or copy and paste from
one (browser) window to another, the encoding needs to be carried along. But
the encoding probably does need to be specified in any cardboard printing.