RE: Information Resources? (naming things on the Web) from Jon Hanna on 2004-09-13 (www-tag@w3.org from September 2004)

From: Jon Hanna <jon@hackcraft.net>
Date: Mon, 13 Sep 2004 02:22:10 +0100
To: <www-tag@w3.org>
Message-ID: <006401c49930$1b0f6450$0201a8c0@Lugh>
a way of telling which URIs are 
> about things that are not web
> pages (things qua things minus the web), and which are about web
> pages qua web pages.

This doesn't seem good to me as a member of the
URIs-can-identify-anything camp and I don't think it will satisfy others
or the other camps.

In a debate between people who say URIs identify anything (including web
pages, though rarely), people who say URIs identify "conceptual
documents" (is that what you mean by "web pages"), a few remaining
people that don't accept that saying URIs identify given HTML documents
(is that what you mean by "web pages") is demonstrably wrong and a few
positions in between the best such a system can hope to do is add to the
debate, not help resolve it.

Determining what a *particular* URI identifies is not the issue, most
camps have solved this to our own satisfaction at least, though we would
not agree with members of the other camps.

Expanded Web Proper Names allow the entire format to 
> be given in some form as a http: URI as as well.

Doing so requires you to buy into one side of the current debate.

> 3) A URI can potentially be used as a name of a thing (which for this
>    discussion is something not "on the web") in an ontology, whether
>    or not any actual statements are made about such a thing.

No. Either a URI can be used as a name for a thing and this makes it "on
the web" or it can't be used as a name, depending on your position in
the debate. Whether an ontology is involved or not is irrelevant, though
I would certainly say that using URIs in ontologies only makes sense if
you allow URIs to mean anything; I would accuse some who hold other
positions of jumping through hoops to make ontologies work, but I'm
trying my best not to be partisan right now.

> 4) Alternatively a URI could be used as a name for a representation
>    about a thing, whether or not a representation is actually
>    retrieved from it.

No. Either a URI can be used as a name for a representation of a thing
because a representation is clearly a thing in itself just as a car, a
hippogryph and Sir Isaac Newton are things, or it cannot be a URI of a
representation, though it can be a URI of a conceptual document for
which only one representation is available.

> If we are making statements (RDF or otherwise), we want to be able to 
> make a statement about either a thing denoted by a URI or the 
> representation denoted by a URI.

To one camp there is a clear relationship between what a URI identifies
(whether that can be anything, or a conceptual document) and it's
representation. RDF is quite capable of doing this if a suitable
ontology is produced. (In fairness I must 410 my own attempt at such an
ontology, it is quite badly flawed). To the other camp URIs do not
denote things.

 Depending on the 
> context, that may or may not be obvious.

URIs do not depend on context. Anything that depends on context is not a
Universal Identifier, since it fails to identify universally (or
uniformly for that matter).

 So, we need a 
> mechanism to tell whether a URI is about a thing or a representation 
> of a thing. One solution (explored by WPN and
> Larry) is a new scheme (such as wpn: or tbl:).

This adds nothing, and removes much. It's just another URI.

> The fact of the matter, and the problem, is that URIs are "universal",
> they can be used for *lots* of things, from naming namespaces 
> and ontologies to retrieving webpages. Thus keeping the 
> definition vague is actually rather useful.

There is nothing vague about a URI being used to name an ontology and to
retrieve a webpage - even if it's the same URI. There is similarly
nothing vague about Inland Revenue, health boards, social services and
other agencies retrieving different information when they input my PPSN
(what those some of you might know as an National Insurance or Social
Insurance number) which is an identifier of me. It's just not a
universal identifier so I shouldn't expect it to work in Canada or for
the keyboard I'm typing on to be issued one. Hence a URI can identify an
ontology and when input into a given system (the web) it can retrieve a
particular piece of information. (I'm going to give up trying to argue
this from both sides it's tiring trying to do justice to views I don't
support, so I'm just going to be partisan from here on in).

> If we are determined to stick to a http URI scheme, we
> should at least have canonical representations for "things"
> to resolve ambiguity.

I think that would remove a useful indirection.

Looking at the document at
<http://www.cogsci.ed.ac.uk/~ht/webpropernames/index.html>:

"Having done this, I know at a glance if the page is actually about the
Eiffel Tower, or a hotel near the Eiffel Tower, as opposed to the
object-oriented programming language Eiffel, or the film The Lavender
Hill Mob, and so on. Yet this knowledge depends on fundamental aspects
of human intelligence such as language understanding, scene recognition
and so forth, which have proved distressingly resistant to automation."

rdf:type

"These two metadata sentences in fact say the same thing--the first URIs
of each triple stand for the Eiffel Tower, the third URIs of each stand
for Gustave Eiffel, its architect. However there is no obvious way for
an automatic process to detect this fact."

There's no obvious way for a human process to detect this either
(whether they are looking at representations for the URIs you give or
they are reading natural language descriptions). There are mechanisms
for both humans and machines to be told this, and also for both humans
and machines to realise this by drawing inferences from commonality
about what is stated about the problematic resources for which we have
multiple identifiers.

"Just knowing when two pages describe the same thing would be a huge
step forward."

That's not always easy with human interpretation of natural language.
It's not always hard with machine processing of RDF and OWL.

"Proper names are names that refer uniquely to one referent, at least in
an ideal situation."

URIs are identifiers that refer uniquely to one referent, at least when
nobody's made a right hames of things. The distinction between "tiddles"
and "cat" is that "tiddles" refers to one cat and "cat" refers to the
class of entities which share various characteristics - for example that
they are a proper subset of the class called "mammals", have strong
limbs, are carnivorous and so on. The distinction between proper names
and names is really a distinction between types of referent and as long
as we have a means of making that distinction with referents (we have)
then we can make that distinction with names (though really the latter
distinction is the more valuable except when deciding whether to
capitalise a word in certain natural languages). Besides which we need
both types of names anyway, unless we are to be prohibited from naming
one sort of referent (classes).

"Our take on the ordinary understanding of URIs is that a URI addresses
a Web-based encoding of a description or depiction of a denotation."

I disagree with this "take". This is what HTTP does with URIs, not what
URIs do with HTTP.

"Also web pages will be used informally to cover both encodings and
expressions in one term, and so will both cover the everyday language
use of the term (as for HTML pages) but also refer to a wider set of
phenomena (such as a URI addressing an audio stream)."

Ah so that's what you mean by web pages. Please forgive me for not
editing the earlier piece where I placed two possible interpretations
into the http-range debate, I think I'd rather let them stand as both
are worth mentioning. I did say that saying that URIs identify these is
demonstrably wrong, so I shall demonstrate it. Point a web browser and a
RDF parser (the validator at <http://www.w3.org/RDF/Validator> will do)
at <http://www.hackcraft.net/jon/> the former will receive an HTML
document, the latter an RDF document. 1 URI, 1 world wide web, 2 web
pages - clearly the URI has not identified a web page. (The matter for
debate is whether it has identified me, or identified a conceptual
document about me [if the latter is true then the RDF is inaccurate]).

"We can now be more precise about what's going on with respect to Web
searches. When searching, a user typically wants to fetch expressions
constituting descriptions (such as HTML or XML pages) or depictions
(such as JPG or SVG images) that actually describe or depict some thing
they are interested in."

I'm not sure I buy, say, a piece of fiction as matching that. A piece of
fiction I'd say is a "conceptual document" because sometimes a cigar is
just a cigar. (This isn't ceding much to the "conceptual document" camp
though - conceptual documents are things too, so we can play your game
when we want to and you can't play ours :-). A software patch is even
further from this. I will agree that your description does match many
common searches though.

"http://www.w3.org/People/thompson/
 http://purl.org/dc/elements/1.1/creator
http://www.ltg.ed.ac.uk/~ht/
"

This is an unusual way of expressing RDF triples, in particular it makes
the difference between use and mention of URIs unclear.

"we have to interpret the first URI as a mention but the second as a
use:"

The first URI is being mentioned? This is true in the natural language
sentence "http://www.w3.org/People/thompson/ is 34 characters long and
uses no character escaping to make use of characters which have special
meaning or are prohibited in URIs". To mention a URI we put it in a
literal, possibly of type <http://www.w3.org/2001/XMLSchema#anyURI>.

Since this isn't allowed with the subject of an RDF triple I assume you
mean:

<http://www.w3.org/People/thompson/>
<http://purl.org/dc/elements/1.1/creator> <http://www.ltg.ed.ac.uk/~ht/>
.

Which means that the entity identified by
<http://www.w3.org/People/thompson/> was created by the entity
identified by <http://www.ltg.ed.ac.uk/~ht/>. If the first entity is a
(conceptual or otherwise) document about Henry S. Thompson's and
<http://www.ltg.ed.ac.uk/~ht/> is Henry S. Thompson then this may well
be true. If either the former is Henry S. Thompson or the latter is a
document then this is not true. We can deal with these situations
though:

<http://www.w3.org/People/thompson/> <xx:rep> <_genid:1> .
<_genid:1> <dc:creator> <http://www.ltg.ed.ac.uk/~ht/> .

("<http://www.w3.org/People/thompson/> has a representation, this
representation was made by <http://www.ltg.ed.ac.uk/~ht/>"). Of course
we can usefully merge the two named and one unnamed nodes here with a
bit more knowledge, but we don't have to in order to be correct.

<http://www.w3.org/People/thompson/> <dc:creator> <_genid:2> .
<_genid:2> <foaf:homepage> <http://www.ltg.ed.ac.uk/~ht/> .

("<http://www.w3.org/People/thompson/> was created by some entity. That
entity has the homepage <http://www.ltg.ed.ac.uk/~ht/>"). (This entails
that the entity in question is a foaf:Agent; that is a person or
organisation, though again this is incidental to our being correct).

"In the context of the Web, there is clearly a non-arbitrary, although
not strictly necessary, relationship between the descriptive terms and
whatever the recovered web pages denote. Insofar as we've hinted that a
Web Proper Name is a collection of search terms, this analogy is
encouraging, particularly because the first step, from search terms to
URIs, is automated and distributed."

I'm not convinced the connection is strong enough to be usefully
reliable.

Google searches not only change over time, but also according to the
language you search in, whether you allow for results google knows
aren't in that language to be returned, whether you are restricting your
search to pages from a particular country, and whether you ask for
sexual content to be filtered out (and I'm told that "Tour Eiffel" is
Parisian slang for an erection, so your example could be affected).
Other search engines have yet more variants. Not all of these variants
are explicitly dealt with in your system, and this may affect it.

"Moreover, when we want to refer to Tim Berners-Lee, we don't have to
redescribe him using his title or the book he's written. A name alone
determines its referent, at least where all parties involved attach the
name to the same referent. Furthermore, this is achieved without appeal
to descriptions."

The second sentence would be true if I mentioned "Tim Berners-Lee" to a
colleague. It would not have been true if I mentioned him to my mum. In
order to know if it's true I need to refer to the context in which I am
using the name (albeit subconsciously), the context is therefore a part
of the naming, I am not using "Tim Berners-Lee" I am using "Tim
Berners-Lee"#WebAwareAudience The only context I can imagine endorsing
when it comes to naming web resources is that which allows relative URI
references to operate. Anything else has to be included explicitly. This
is generally heavier than making the name work outside of context.

Section 4 generally.

I'm tempted to suggest
http://www.google.com/search?q=eiffel+tower+paris+-hotel+-webcam as an
alternative URI, though I would take that to mean "the results of
searching for eiffel tower paris -hotel -webcam on google" rather than
to mean eiffel+tower. I *have* used such URIs to refer to things before
though, I'm not sure whether or not this is a good idea. I am sure it's
not a great idea, and don't see what you offer beyond that, except for
the date (which isn't going to tell me much in 10 years, or even 10 days
time) and the guarantee of the degree of accuracy (likewise).

"Allows use of Web names to be easily distinguished from mention of
URIs"

Sorry, I just don't buy this. I can use it and I can mention it. I can
use http URIs and I can mention them. The only difference I see is that
I can use http URIs more fruitfully and indeed with less ambiguity with
current technology.

"Allows for efficient and reliable determination of whether two URIs
identify resources which are about the same thing"

I don't buy this either. Certain wpn URIs may be similar enough in terms
and even in short name (though many things have multiple short names
used often in normal speech that are not at all alike) that they are
probably about the same thing. Unless the terms are exactly the same,
the dates particularly close and the percentages such that it is
mathematically impossible that the refer to different entities then I'm
not buying. Even when there is a definite match this could be because of
overlapping terms, or terms which are hyponyms of another sense of
itself (the sense of "cat" that has "lion" as a hyponym has another
sense of "cat" as another hyponym).

In comparison smushing on inverse functional properties or owl:sameAs
statements is pretty reliable. Granted I may not have all such
statements available to do this perfectly, but that is true of the terms
used in WPNs too.

4.2

HTTP URIs on the other hand are always strong.

4.3

Why not just use <http://www.ihmc.us/users/phayes/PatHayes.html> and be
done with it? Of course there are those who say you can't use
<http://www.ihmc.us/users/phayes/PatHayes.html> at all. So depending on
one's position on HTTP-Range this is either pointless or its impossible.

5.

This answers some of my less pressing objections above, but it's just
another representation. I'm not sure it's even a particularly good one
(though I'm biased against RDDL at the best of times).

6.1

I'm not sure this is a problem. I think it's a category error about what
a URI means and any resolution to the http-range issue will remove it.

6.2

I don't see how this is particularly authoritative. I don't see how it
is any more useful than any other representation to either human or
machine processing.

6.3

"Interesting", "useful" and "nice to work with" seem to me more
important criteria for bookmarks than the degree to which they are
relevant to a given search. For that matter many, at times most, of my
bookmarks are items I found that were completely irrelevant to what I
was researching at the time but which it seemed it would be good to read
otherwise. I'm not sure what problem is being solved here.

6.4

Sem-web development already has a large bottom-up component in pretty
much everything but the core specs (and those are produced in an open
manner).

I think the sameAs is very dicey indeed. I don't think the criteria for
saying "yep, that's the Eiffel Tower I mean" is strong enough to say
"This URI denotes the Eiffel Tower". It could be applied to a URIs
meaning "The Eiffel Tower during construction", "The Eiffel tower with
the flag of the Third Reich during the occupation of Paris", "The Eiffel
Tower", "View from my hotel window during my holiday" and "Man jumps
from Eiffel Tower".

Further cases would cause some of these URIs to be declared to be
owl:sameAs construction, the Nazi occupation of France, holidays and
suicide. This could be especially quick given that the way Google works
would mean certain pages with the ambiguity that could make this happen
more likely to be returned in a set of results. The result is a semantic
web grey-sludge scenario.
Received on Monday, 13 September 2004 01:22:36 UTC