RE: Proposed issue: site metadata hook from Patrick.Stickler@nokia.com on 2003-02-19 (www-tag@w3.org from February 2003)

From: <Patrick.Stickler@nokia.com>
Date: Wed, 19 Feb 2003 16:17:10 +0200
To: <chris@w3.org>
Cc: <www-tag@w3.org>, <timbl@w3.org>
Message-ID: <A03E60B17132A84F9B4BB5EEDE57957B5FBB32@trebe006.europe.nokia.com>
> -----Original Message-----
> From: ext Chris Lilley [mailto:chris@w3.org]
> Sent: 19 February, 2003 14:13
> To: Stickler Patrick (NMP/Tampere)
> Cc: www-tag@w3.org; timbl@w3.org
> Subject: Re: Proposed issue: site metadata hook
> 
> 
> On Tuesday, February 18, 2003, 7:19:20 AM, Patrick wrote:
> 
> PSnc> It seems we are talking past each other.
> 
> It would seem so.
> 
> PSnc> I'm going to suggest that we both are in favor of the 
> architecture
> PSnc> *allowing* all users to be able to control their own 
> personal web
> PSnc> spaces, even when they do not own the server.
> 
> I can agree to that.
> 
> PSnc> But that the architecture itself does not mandate 
> specific rights
> PSnc> of control for all users against the wishes of the server owner.
> 
> PSnc> Thus, if the server owner wishes to allow user-specific control,
> PSnc> the architecture should take that into consideration, 
> and support
> PSnc> that level of resolution.
> 
> PSnc> But the architecuture should not permit users to circumvent the
> PSnc> explicit wishes of the server owner.
> 
> PSnc> Yes?
> 
> That seems a good summary. I particularly liked 'explicit wishes' as
> opposed to 'assumed wishes'. In other words if the server owner wants
> to deny user-specific control they need to say so explicitly.
> 
> >> Lets consider an architecture where the corporation owns / 
> >> and accounting
> >> owns /corporate/accounting and marketing owns /comm/pr
> >> 
> >> Lets assume that the corporation decides that it does not want /
> >> crawled. Lets assume that marketing wants /comm/pr crawled.
> 
> PSnc> Then I would say too bad for /comm/pr. If the owner says "this
> PSnc> server will not be crawled" then it shouldn't, no matter what
> PSnc> any user says.
> 
> My point was that the corporation as a whole was expressing wishes
> about a pruned tree (minus leaves or leaf subtrees that are delegated
> elsewhere).
> 
> Currently its not really possible to express the difference between
> 'this whole (sub) tree' and 'this whole(sub_tree up to the point where
> the rules change.
> 
> 
> PSnc> HOWEVER, if the corporation is saying "only areas explicitly 
> PSnc> specified to be crawled, by the users responsible for those
> PSnc> areas, may be crawled" then that is something different.
> 
> Right.
> 
> PSnc> I understand (now a bit better) that what you are asking for
> PSnc> is the architecture to be able to allow users to express their
> PSnc> wishes over their own content, and for robots to take that
> PSnc> information into account *IFF* the server owner permits it to.
> PSnc> (it's the IFF I thought you were leaving out...)
> 
> Exactly.
> 
> PSnc> But that the present architecture is too coarse to allow for
> PSnc> efficient management of user-specific wishes in that regard
> PSnc> and thus needs to be refined.
> 
> PSnc> Right?
> 
> Correct.


OK. It looks like we have been in violent agreement ;-)


> PSnc> A specific question to help me determine that: If the 
> server owner
> PSnc> says "no crawlers at all on this server" and a tenant 
> says "all my
> PSnc> own content can be crawled", should the tenant's 
> content be crawled?
> 
> See above regarding current poverty of expressiveness. The same
> language has to do double duty to keep crawlers off rootwards areas,
> and all areas.

Right. If the current architecture doesn't provide for the
needed level of expressiveness, then it has to be improved,
but not necessarily with yet another crawling specific
solution.

> >> And you would do that how?
> 
> PSnc> By obtaining and inspecting the RDF description of the site,
> PSnc> examining those properties that describe robot behavior and
> PSnc> recursively obtaining and inspecting the RDF descriptions of
> PSnc> whatever resources are relevant to answering the question,
> PSnc> including the description about the particular user space, etc.
> 
> Ok so this now brings us on to the area of efficiency. See Apache
> .htaccess files and CERN .meta directories for similar problems.
> 
> It is undesirable if, to find the metadata for an area n steps from
> the root, I need to make n requests (in either direction, root towards
> area or area towards root).

But you wouldn't. If you know the URI denoting the site or space you
are interested in (e.g. http://example.com/~John) then it takes one
single MGET call to get the description of that resource. And if you
don't know the URIs of the individually managed subspaces on a site,
you can ask about the site to get them.

One can first obtain the server description by an MGET on the server
URI (http://example.com) to inspect the crawling preferences of the
server itself as well as to obtain the URIs of the individually
managed subspaces on the server. If the server description doesn't 
indicate that crawling is disallowed entirely, one can then MGET
the description of each subspace on the server, and if crawling
is allowed (or not explicitly disallowed) the bot can crawl that 
subtree of the server. And it can both crawl the representations
of resources in that subtree with GET as well as crawl the 
descriptions of those same resources with MGET.

But the means by which one would express that knowledge about
crawling preferences and obtain that knowledge is not specific 
to crawlers. Rather, it's just
knowledge about resources, where those resources just happen to
be web sites, subsites, etc. and generic SW machinery is used
to achieve a highly effective solution to this problem (as it
will for many others).

> >> PSnc> Why do we need anything more than the semantic web extensions
> >> PSnc> to the present web architecture
> >> 
> >> If I knew clearly what those were then I might be able to 
> answer you.
> >> But at present there does not seem to be a list of them.
> 
> PSnc> They have been mentioned repeatedly in this very thread:
> 
> Ok those are proposals. They don't exist yet, and there is no clear
> specification of them so they are difficult to discuss.

I agree. I wasn't quite ready to introduce them, but was prompted
by TimBL's post... it's definitely suboptimal to propose
non-trivial technology by email... ;-)

I am finishing up an open source implementation that I will shortly
publish, along with documentation. I plan to demo it at the 
technical plenary, informally, to those interested.

Cheers,

Patrick
Received on Wednesday, 19 February 2003 09:17:25 UTC