RE: Where should you look for metadata? from Danny Ayers on 2001-02-16 (www-rdf-interest@w3.org from February 2001)

From: Danny Ayers <danny@panlanka.net>
Date: Sat, 17 Feb 2001 00:47:04 +0600
To: "Sean B. Palmer" <sean@mysterylights.com>, "David Megginson" <david@megginson.com>, <www-rdf-interest@w3.org>
Message-ID: <EBEPLGMHCDOJJJPCFHEFEEEOCMAA.danny@panlanka.net>
Hi,
I finally got around to resubscribing (prompted mainly by Sean & co's
material), and straight away there's a topic close to my heart ;-)

One aspect that doesn't appear to have been raised is the use of automation
in determining a site's content. Ok, so some of the technology's in it's
infancy, but a lot more is knocking around research circles, and it's only a
matter of time before this gets sucked into the mainstream. Why not have
trusted third parties that get their metadata using bots?

Taking the numbered points :

<- > 1. What is the subject matter of the page?

It's easy enough to scan the text off a page, do a bit of prioritization of
keywords (according to 'visual' importance - i.e. <H1> stuff before body
text). Keywords can be matched against known categories. With a bit of
statistical juggling or even NLU I reckon it would be perfectly feasible to
get results of a usable accuracy.

<- > 2. Is it suitable for the young or easily-offended?

Using the same kind of technique as 1., and if anything it's a bit easier as
9/10 the stuff that offends easily-offended people can be easily
characterised by its vocabulary.

<- > 3. Is its content free, or does it require payment?

This one I'll skip, as any company that has a pay site will almost certainly
make this information visible - otherwise how are they going to make any
money?

<- > 4. Does the owner share any cookies or other personal information with
<- > other parties?

Ok, this one is probably beyond automation, but the basic trusted third
party idea can surely still be used - the registry of mail servers that are
open to relaying spam seems to work quite well (I can't remember the
name/url, though it forced me to patch a security hole pretty quickly on one
occasion).

<- > 5. How highly have users rated the content?

A user has to somehow and somewhere rate the site, and that information has
to be stored and the system protected from abuse. Not entirely unlike the
DMOZ directory. The only thing that needs doing here is it making so easy
(or remunerative) for users to do the rating that they can do it without
much (perceived) effort.

Cheers,
Danny.

---
Danny Ayers
http://www.isacat.net

<- -----Original Message-----
<- From: www-rdf-interest-request@w3.org
<- [mailto:www-rdf-interest-request@w3.org]On Behalf Of Sean B. Palmer
<- Sent: 16 February 2001 20:28
<- To: David Megginson; www-rdf-interest@w3.org
<- Subject: Re: Where should you look for metadata?
<-
<-
<- > 1. What is the subject matter of the page?
<- > 2. Is it suitable for the young or easily-offended?
<- > 3. Is its content free, or does it require payment?
<-
<- I'd say that these three would most likely be encoded in the resource, or
<- at least linked to from the resource by whatever means. Why
<- would you trust
<- an outside resource to derive the subject matter of the page? Having said
<- that, <meta> tags are often abused too, but they could be seen as being
<- separate from the main content. Of course, for a lot of HTML pages, the
<- very notion of structured data breaks down...
<-
<- > 4. Does the owner share any cookies or other personal information with
<- > other parties?
<- > 5. How highly have users rated the content?
<-
<- Yes, these are most likely external resources. Like I say "Sometimes it
<- makes sense to encode data about at resource at that resource address, at
<- other times it doesn't."... some data about data is best left to external
<- resources, whereas other is best to derive from the resource itself.
<- There's a thing dividing line... but it is also wrong to say
<- that a URL is
<- not a good place to store information about that URL: it simply
<- depends on
<- a) the type of data, b) trust, c) the whim of the author.
<-
<- --
<- Kindest Regards,
<- Sean B. Palmer
<- @prefix : <http://webns.net/roughterms/> .
<- [ :name "Sean B. Palmer" ] :hasHomepage <http://infomesh.net/sbp/> .
<-
Received on Friday, 16 February 2001 13:49:21 UTC