- From: dehora <bill@dehora.fsnet.co.uk>
- Date: Sun, 15 Apr 2001 14:26:43 +0100
- To: "Danny Ayers" <danny@panlanka.net>, "RDF-Interest" <www-rdf-interest@w3.org>
: <- Refining the query may well be the easy bit. Controlling metadata : <- spoofing in the target data sounds really hard to me: that's one : <- thing the number crunching appraoch to search has in it's favour, : <- the potential for hybrid approaches (such graphing citations) : <- notwithstanding. : : Why should metadata spoofing be an issue? Its practice on the web appears to : be on the decline, maybe as the pron vendors have started realising that : targeting is a more efficient sales strategy than pissing people off. You don't think that has a lot to do with search engines skirting around META tags? Or eliminating mass repetition of terms. Or lawsuits? Or indeed use of evaluations that are more difficult to spoof, like the way Google uses backlinking? Pissing people off isn't an issue when a response rate of 3% is sufficient margin. Web metadata spoofing is an issue because people spoof, all the time. The interesting difference being that machines aren't well equipped to detect it, unless you a: add heuristics to your code to block spoof, and you wind up in an arms race, or, b: if the results seem "good enough" from numerical/statistical methods, just not use supplied metadata at all; and that seems to be state of play in commercial IR. : In any case, the originator of the data doesn't have to be the only source : of metadata - think DMOZ. Sure, and DMOZ is a good example. But something is needed to weight and evaluate all that 3rd party stuff...reification isn't sufficient. : Also there is a lot of unrealised potential in : that there number crunching - I've been looking at applying self-organising : maps [1] to searching/automatic cataloguing, and I reckon it's perfectly : feasible to classify according to semantic content through e.g. SOM-like : conceptual mapping. This is only one of many available techniques - but in : general generating metadata from content pretty much precludes spoofing. Unsupervised learners are cool (especially if you've ever tried to train a neural net to do anything beyond parity ;). The thing with SOMs and their ilk: you're going down a path that ends up as likely as not using mainly statistical methods. I wasn't referring to auto-regeneration of metadata from a data set in my previous mail, but that's still an interesting approach. I wonder why you need RDF for that though, and not just matrices. Unless RDF becomes an xml file format first and a representation format second. I expect the argument is you can use it for either. Eric Miller spoke about some very interesting work auto-generating meta-data from data in Amsterdam last years. it was a cross-discip team of linguists and IR folk, iirc. If he's following this, maybe he'll throw in a link to that work ;). Actually, on that matter, anybody have links to hybrid or layered methods (numerical + symbolic) wrt metadata and information retrieval? regards, Bill de hÓra
Received on Sunday, 15 April 2001 09:27:09 UTC