(unknown charset) Re: Relative URI vs. URN, and URI uniformity.

to follow up on what Sam Sun said:

> I think it might be helpful to add that the ?#fragment? and
> relative URI are two kind of entities within any certain name
> and are processed differently.  For example,
> href=?relative-uri? is processed by binding ?relative-uri? to
> its base-uri, and sending the complete URI across the wire. But
> for href=?  foo:aaa#bbb?, according to the current URI draft,
> the ?#bbb? would be cut off from the URI reference (since it?s
> not part of URI), and only the ?foo:aaa? get sent over the
> wire.

The idea that "gets sent over the wire" has anything to do with
URIs in general is IMHO a popular misconception.  URIs are
identifiers.  The are strings that key the retrieval of
information in some way as to make the reference widely usable
throughout the Internet.

But file: URLs don't involve exercising telecommunications
protocols, and one of the primary beauties of the Web is the way
the difference between local and remote references is hidden by
the unifying power of the unified scheme of references.

Many news: URLs are totally ambiguous as to whether TCP/IP will
be required to retrieve the identified resource.

For example, an HTML- or XML- processing Ap recognizes an
URI-reference within a document context conditioned by the rules
of that language.  But then it is exercised in a
resource-retrieval service, not a comm protocol, in general.
Lots of these resource-retrieval services employ comm protocols,
but not necessarily all of them.

> My observation is that relative URI defines a client side
> process for compounding names. Based on libwww.lib
> implementation, relative URI never leaves the client side by
> itself, but have to bind to the URI scheme in its base URI
> before it can be of any use. So, if URI is considered a machine
> to machine protocol syntax, is relative URI an URI?

Use of the #fragment can be safely deferred until after the
enclosing object containing the name 'fragment' has been
retrieved.  For lots of schemes such as file: and ftp: it would
break the retrieval to mention the #fragment in external
dealings.  This combination of ingredients has made it customary
to strip the #fragment and then deal with the rest in a
schemewise fashion.  I agree with your arguments that this is not
logically necessary.  But we have to face the fact that it is
essentially-universal current practice.

>  The URN working group defined the syntax for identifiers to be
> transferred over the wire. If I understand correctly, URN
> syntax is designed mainly as a machine to machine protocol
> syntax. 

Then it's not an URI.  That needs to be person-to-machine ready.

> If there were any relative URN to be defined, it would
> mean that the URN service could not be stateless, and have to
> keep history of previous transactions in order to construct
> compound names, which doesn't seem very practical.

> This leads to the question to what URI is.
> 

> First, an observation: Some URI schemes, like “http:” or
> “urn:”, have the client side syntax follow the machine to
> machine protocol syntax. Some other URI schemes don’t. For
> example, the ftp server will not know to convert %23 to
> ‘#’, and when you send
> “ftp:user%23&pass%23word@foo.com”, the ftp server at
> “foo.com” will not recognize you are user “user#”,
> and entering password “pass#word”. Another example is
> LDAP whose protocol uses UTF-8 encoding, but the URL syntax
> seems to follow the http URL.

Yes, this is all handled already.  Characters that are URL-unsafe
are restored before the ftp communication protocol is exercised
by the gateway acting as ftp: schemwise retrieval service.

> It seems more natural to consider URI as a client side referral
> syntax. For any URI “foo:foo-specific-name”, the URI is
> responsible only to refer “foo-specific-name” to
> “foo:” module, but nothing more. Individual scheme should
> be allowed to decide how to parse its scheme specific data, and
> how to process the “#fragment”. Each scheme should be
> allowed to decide its own set of reserved/excluded characters,
> its character set encoding, and whether the client-side syntax
> follows the protocol syntax.

> If this is the case, it seems that for URI, the only reserved
> characters needed is byte ‘%25’, which is character % in
> ASCII encoding. And the only excluded character needed is byte
> ‘%22’, which is character ” in ASCII encoding. The
> ‘%25’ is needed to allow non-printable characters be
> entered and be understood. The ‘%22’ is necessary for
> separating URI from its surrounding context.

> Also, URI doesn’t have to be constrained to a subset of
> ASCII characters only, but should let individual URI scheme to
> decide how to support international character sets. Based on
> what I saw, the only strong arguments for URI to be ASCII only
> is that it is printable and can be entered from almost any (not
> all!) keyboard. These might be nice user interface features for
> “http:” URL, not necessarily for all other URIs. To be
> short, not every document is written to be readable by anyone
> around the world, nor would it necessary to require _every_
> NAME to be defined printable and enterable by anyone around the
> world. It should be a decision of the name issuer, not the
> underlying technology.

Yes, but every URI needs to be written to be interpretable in
many, many contexts used for global information distribution.
You want name visibility to be unhampered or it's not an URI.
The internationalization of URIs may have room for variations
lower down, but the internationalization scheme has to boot from
somewhere known or or we don't have URIs anymore.

URIs are used in a range of contexts.  Two of the most common are
HREF attributes in HTML documents and Location: etc. headers in
HTTP messages.  Header-safe character restrictions are a
efficiency and error-reducing measure in this major domain of
application.  This restriction should not be removed until the
replacement scheme has been demonstrated not to materially
degrade service in this scenario.

The restriction to the current RFC-822-header-safe subset of
ASCII is temporary under the plans as I hear them.  But it does
not make sense to open this up to a schemwise free-for-all or the
clients will choke on the necessary library.  Saying that some
clients will support some schemes defeats the purpose.  The point
of URIs is so that more clients can support more schemes.

I think that

 "Character Set" Considered Harmful
 http://www.w3.org/MarkUp/html-spec/charset-harmful.html

may be relevant here.

>  Essentially, I’m suggesting that the uniformity of URI
> should be only on its scheme binding syntax, as is commonly
> accepted in the web context, but not extend into the scheme
> specific content.

You need a better migration plan for existing clients.  To
declare this abruptly will cause gaps in service and loss
of customer confidence.

Al

Received on Tuesday, 3 March 1998 14:29:40 UTC