Bayesian classification and the semantic web

One of the barriers to acceptance for RDF is the lack of RDF annotation
out there on the web.  I thought I'd toss out this idea and see what the
group thinks.  

I use a Bayesian classifier to reduce spam.  It filters my incoming mail
divides that mail into categories (spam/ham).  These classifiers are, in
general, pretty darn accurate.

Now imagine that someone builds a big corpus of URIs, and starts to
categorize them.  Someone like, say, Google, or DMOZ.  The text of each
URI is used to create Bayesian classifiers.  

Google then provides a simple service that, given a URI, reads the text
at the URI and performs classification (via a nested set of
classifiers), marks up the contents with RDF, and returns the result.
This amounts to a best guess, dynamic assignment of semantics to that
URI.

The corpus can be tuned to produce better and better results, but in
general, the scheme should be pretty accurate.  Of course, Google
already has a huge database of word/count/frequency information which
can be used to seed this process, to considerable effect.

The probability-based approach should yield substantially better results
that keyword identification schemes.

RJ

Received on Thursday, 31 July 2003 15:24:46 UTC