RE: A Simple Analogy

: <- Refining the query may well be the easy bit. Controlling metadata
: <- spoofing in the target data sounds really hard to me: that's one
: <- thing the number crunching appraoch to search has in it's favour,
: <- the potential for hybrid approaches (such graphing citations)
: <- notwithstanding.
:
: Why should metadata spoofing be an issue? Its practice on the web appears to
: be on the decline, maybe as the pron vendors have started realising that
: targeting is a more efficient sales strategy than pissing people off.

You don't think that has a lot to do with search engines skirting around
META tags? Or eliminating mass repetition of terms. Or lawsuits? Or indeed
use of evaluations that are more difficult to spoof, like the way Google
uses backlinking? Pissing people off isn't an issue when a response rate
of 3% is sufficient margin.

Web metadata spoofing is an issue because people spoof, all the time. The
interesting difference being that machines aren't well equipped to detect
it, unless you a: add heuristics to your code to block spoof, and you
wind up in an arms race, or, b: if the results seem "good enough" from
numerical/statistical methods, just not use supplied metadata at all; and
that seems to be state of play in commercial IR.


: In any case, the originator of the data doesn't have to be the only source
: of metadata - think DMOZ.

Sure, and DMOZ is a good example. But something is needed to weight and
evaluate all that 3rd party stuff...reification isn't sufficient.


: Also there is a lot of unrealised potential in
: that there number crunching - I've been looking at applying self-organising
: maps [1] to searching/automatic cataloguing, and I reckon it's perfectly
: feasible to classify according to semantic content through e.g. SOM-like
: conceptual mapping. This is only one of many available techniques - but in
: general generating metadata from content pretty much precludes spoofing.

Unsupervised learners are cool (especially if you've ever tried to train a
neural net to do anything beyond parity ;). The thing with SOMs and their
ilk: you're going down a path that ends up as likely as not using mainly
statistical methods.

I wasn't referring to auto-regeneration of metadata from a data set in my
previous mail, but that's still an interesting approach. I wonder why you
need RDF for that though, and not just matrices. Unless RDF becomes an
xml file format first and a representation format second. I expect the
argument is you can use it for either.

Eric Miller spoke about some very interesting work auto-generating meta-data
from data in Amsterdam last years. it was a cross-discip team of linguists
and IR folk, iirc. If he's following this, maybe he'll throw in a link to
that work ;).

Actually, on that matter, anybody have links to hybrid or layered
methods (numerical + symbolic) wrt metadata and information retrieval?

regards,
Bill de hÓra

Received on Sunday, 15 April 2001 09:27:09 UTC