RE: [Minutes] 24 Feb 2003 TAG teleconf (site metadata, namespaceDocument-8) from Patrick.Stickler@nokia.com on 2003-02-27 (www-tag@w3.org from February 2003)

From: <Patrick.Stickler@nokia.com>
Date: Thu, 27 Feb 2003 09:55:28 +0200
To: <ij@w3.org>, <www-tag@w3.org>
Message-ID: <A03E60B17132A84F9B4BB5EEDE57957B5FBB40@trebe006.europe.nokia.com>
>   2.1 Site metadata hook
> 
>      ...
>
>    [Chris]
>           there is no way to give a URI of a site as opposed 
> to a URI for
>           a welcome page for it
>           hmm... sites are significant resources, no? so they 
> should have
>           URIs.....
> 
>    [Roy]
>           /

I would propose that

   http://example.com    denotes the HTTP server

thus

   <http://example.com> a x:WebServer .

and that a separate URI scheme is needed to denote
actual physical machine, since the http: URI scheme
is rooted in an HTTP server (web authority) and the
underlying "reality" of what actual machine that
HTTP server is running on is below the "atomic" level
of the http: URI scheme. (see below)

As we wish to make a distinction between the HTTP server
and the body of knowledge served by that server, i.e.
between server and site, then I agree that

   http://example.com/   denotes the web site, managed
                         by the HTTP server http://example.com

thus

   <http://example.com/> a x:WebSite .

Having those two distinct URIs allows then one to speak
of the HTTP server specifically, such as its configuration,
and the web site specifically, such as access rights, 
conditions of use of content, robot/crawling prefs, etc.

And some subspace within that site can also be asserted as
a web site, such as

   http://example.com/~fred/ denotes Fred's web site

i.e.

   <http://example.com/~fred/> a x:WebSite .

When one does a GET on either http://example.com or
http://example.com/ we are simply redirected to a default home 
web page, which may be denoted by any of

   http://example.com/index.html
   http://example.com/index.htm
   http://example.com/index.jsp
   http://exmaple.com/foo.blargh

or whatever the HTTP server has been configured to use as the
default page.

When one does an MGET ;-) on http://example.com one gets a description
of the HTTP server.

When one does an MGET on http://example.com/ one gets a description
of the web site, including robot preferences, RSS feeds, whatever.

When one does an MGET on http://example.com/~fred/ one gets a description
of fred's web site, including Fred's robot preferences, etc. 

When one does an MGET on http://example.com/index.html one gets a description
of a web page.

Etc...

>    [TBray]
>           No, "/" isn't the site it's the server, they're not the same
>           things

Is that formally defined in some spec somewhere? Why can't we
say that a URI having "http://"{AUTH}"/" denotes the root site
of a given server "http://"{AUTH}. Seems pretty intuitive and
consistent.

As I understand it, the web server behavior is to interpret both
http://example.com and http://example.com/ as resolving to the
same entity.

But that resolution process could be seen as a redirection to a default
home page, and that entity as a representation of that home page.

Yet each of those URIs can still denote the server and site respectively. 

The redirection gets around the need for those URIs to denote the
home page, and avoids any ambiguity.

>    [timMIT]
>           Server isn't a perfect name eitehr ... tends to be 
> a computer.

Tends to, yes, but one physical computer can host many virtual
web servers, with all of those servers domain names mapped to
the single server IP. 

The actual physical server level seems completely opaque to
http: URI semantics, rooted in the particular HTTP servers,
not the machines hosting those servers.

If we want an explicit URI to denote a physical machine, we
need something other than an http: URI, *IF* we want that
machines identity to be independent of any particular
web server identity. E.g.

   host:example.com   denotes the physical machine to which the
                      domain name example.com resolves

One could then make statements about that particular machine,
such as the owner, location, physical characteristics, etc.

>    [TBray]
>           Chris: echoing problem of site/server disconnect, bad
>           architecture to require everyone to write one file
>           Chris: if a Site is an important thing, it should 
> have a URI;
>           right now there's no such thing
>           Chris: per our axioms
>           Roy: When robots.txt was invented.. (Chris: 
> everyone had their
>           own server) .. the idea was to knock politely on 
> some part of a
>           naming authority's domain
>           Roy: haven't seen a proposal yet with equivalent semantics

Interesting, I thought the MGET proposal was precisely that.

1. Take the knowledge now expressed in a robots.txt file.
2. Express that knowledge as RDF statements about the web site.
3. Expose that knowledge via a "semantic web enabled" server.
4. Do an MGET on the URI of the web site to obtain that knowledge.

Seems like a very polite way to knock on a naming authority's door
to ask about, well, *anything* within the domain of that naming
authority. Not just about crawling preferences.

And since one can also describe subsites for tenants of the main
server, one can ask specifically about those sites as well, using
the same machinery.

And the MGET machinery is fully open and, if the server/site owners
permit, fully supports each tenant to express their own knowledge
about their own individual sites and content.

So the owner of http://example.com/ can state the site-global
preferences, which may very well permit sub-site crawling. And
John can state the preferences for http://example.com/~john/
(using the very same vocabulary, no less) which compliment
the knowledge expressed about the global site.

I.e, it's the kind of solution that I understand Chris wants.

But not simply just solving the issue of a more open, generic,
standardized way to express robots.txt knowledge about a web
site, but an open, generic, standardized way to express
knowledge about *ANY* resource whatsoever in the domain
of a given web authority.

If we're going to change the web architecture, why not "kill
a thousand birds with one stone" rather than just one or two
birds?

I say that MGET and friends represent the stone we need.

Cheers,

Patrick

--
Patrick Stickler, Nokia/Finland, (+358 40) 801 9690, patrick.stickler@nokia.com
Received on Thursday, 27 February 2003 02:55:48 UTC