URI terminology demystified

Pat writes:
| I am unclear why N-triples refers to URIs 
| by a different name, but it does, and I'm just following along here.
	-- Wed, 19 Sep 2001 19:35:49 -0500

It's sorta embarrasing to explain what a mess the URI specs
are, but it's nothing deep or subtle, so let's see if I can explain...

Q: What's a URI?

A: Everybody knows what a URI is; most folks call it
a URL. They look like this:

	http://www.w3.org/
	ftp://gatekeeper.dec.com/
	tel:+1-617-253-2613
	http://www.w3.org/2001/02pd/

Q: Gee; those are sorta long and tedious... can I
abbreviate them, kinda like filenames in Unix and DOS?

A: sure... in an HTML (or RDF) document available at
http://www.w3.org/2001/foo, you
can shorten http://www.w3.org/2001/02pd/gv
to just 02pd/gv as in

	<a href="02pd/gv">...</a>

or
	<rdf:Description rdf:about="02pd/gv">...</rdf:Description>

The value of the href/resource attribute is called
a _URI reference_*; all URIs are URI references,
but not all URI references are URIs.
Whenever you use a relative URI reference
(i.e. one that isn't a full, absolute URI),
you have to be clear about what URI you mean
for it to be relative to so that the consumer
can turn it back into absolute form; this
is called the _base URI_.

Q: Say... speaking of HTML href, I've seen these
hash-mark deelies (#foo). What are those all about?

A: In a URI reference, the hash mark
(aka octothorpe; a fifty-cent word you can
have for free ;-) separates a _fragment
identifier_ from the rest of the URI reference.
For example:

	<a href="02pd/gv#color">...</a>

Q: So the absolute form of that would be

  http://www.w3.org/2001/02pd/gv#color

right?

A: right.

Q: and that's an (absolute) URI, right?

A: you would think so; and as of RFC1630, the term
URI included things with fragment identifiers.
But RFC1630 is just informational; somewhere
along the road to RFC2396, the Internet Draft
Standard, the term "URI" was narrowed to exclude
things with fragment identifiers.

Q: so what do we call it?

A: you can call it a URI reference; it is
one of those.

Q: but URI references include relative
deelies like "foo/bar"; I want a term
that excludes those but includes things
with fragment identifiers. That's
what we're using in ntriples after all, no?

A: Yes, that's what we're using
in ntriples, but there is no ratified term for
URI-plus-optional-fragment-identifier.
Sorry. If it's any consolation, I'm on
record wanting such a term for quite
some time now:

cf URI terminology, esp. in XML specs
   Dan Connolly (Mon, Jan 10 2000)
   http://lists.w3.org/Archives/Public/uri/2000Jan/0002.html


Q: sigh... ok... but what about ampersands in
href/resource attributes?

	<a href="lookup?name=Doe&amp;age=27">...</a>

A: that's just an XML quoting mechanism. It really
has nothing to do with URIs. The value of that
href attribute is
	lookup?name=Doe&age=27
just like the value of the dc:title attribute in
	<rdf:Description dc:title="AT&amp;T Stuff"/>
is
	AT&T Stuff

It's unfortunate that the CGI designers couldn't
be convinced to switch from & to ; for this purpose;
it would have allowed folks to copy-and-paste
without dealing with these quoting issues. Again,
my I-told-you-so is on record:


            NOTE - The URI from a query form submission can be
            used in a normal anchor style hyperlink.
            Unfortunately, the use of the `&' character to
            separate form fields interacts with its use in SGML
            attribute values as an entity reference delimiter.
            For example, the URI `http://host/?x=1&y=2' must be
            written `<a href="http://host/?x=1&#38;y=2"' or `<a
            href="http://host/?x=1&amp;y=2">'.

            HTTP server implementors, and in particular, CGI
            implementors are encouraged to support the use of
            `;' in place of `&' to save users the trouble of
            escaping `&' characters this way.

	-- http://www.ics.uci.edu/pub/ietf/html/rfc1866.txt


Q: OK... I think I'm catching on... but what
about internationalization? RFC2396 says that URIs
are built from US-ASCII (7 bit) characters; but
XML attribute values are full unicode strings.
What happens if somebody writes

	<a href="d&uuml;rst">...</a>


A: Well, remember the asterisk (*)
when I said the value of an href/resource attribute
is a URI reference? I lied. In general, it's
something that can be converted to a URI
reference as follows:

	- start with the Unicode string that
	is the attribute value (i.e. convert &amp;
	to & and &uuml; to whatever unicode character
	that is per XML quoting rules)

	- encode it as UTF-8; this yields a
	sequence of octets

	- for each octet that corresponds to
	a legal URI character (see the uric
	production in RFC2396) use that
	character (all URI characters are US-ASCII, recall).

	- for other octets, write them as %hh
	where hh are the two (US-ASCII) characters
	that make a hexadeciaml numeral for that octet.

So the resulting URI reference from your
question is

  d%fc%ferst

[or something; at this point, I'd have to look up
the codes to be sure.]

see the W3C charmod spec (and the HTML 4.01 spec, 
and the XLink spec, and a recent IURI internet draft) for official
specification of this unicode-to-URI stuff.


Q: What a mess! Are you serious? For a technology so
architecturally core to RDF and the Web, that's
quite a kludge-tower!

A: er... what can I say? That's the state of the art
as I understand it.



-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/

Received on Wednesday, 19 September 2001 22:26:56 UTC