- From: Calogero Alex Baldacchino <alex.baldacchino@email.it>
- Date: Sat, 03 Jan 2009 20:22:25 +0100
Dan Brickley ha scritto: > On 3/1/09 14:02, Julian Reschke wrote: >> Tab Atkins Jr. wrote: >>> The most successful alternative is nothing at all. ^_^ We can >>> extract copious data from web pages reliably without metadata, either >>> using our human senses (in personal use) or natural-language-based >>> processing (in search engine use). It has not yet been established >>> that sufficient and significant enough problems *exist* to justify a >>> solution, let alone one that requires an addition to html. That is >>> what Ian is specifically looking for. >> >> That's what you and Ian claim. Many disagree. > > My main problem with the natural language processing option is that it > feels too close to waiting for Artificial Intelligence. I'd rather add > 6 attributes to HTML and get on with life. > > But perhaps a more practical concern is that it unfairly biases things > towards popular languages - lucky English, lucky Spanish, etc., and > those that lend themselves more to NLP analysis. *The Web is for > everyone*, and people shouldn't be forced to read and write English to > enjoy the latest advances in *Web automation*. Since HTML5 is going > through W3C, such considerations need to be taken pretty seriously. > My concern is: is RDFa really suitable for everyone and for Web automation? My own answer, at first glance, is no. That's because RDF(a) can perhaps address nicely very niche needs, where determining how much data can be trusted is not a problem, but in general misuses AND deliberate abuses may harm automation heavily, since an automaton is unlikely to be able to understand whether metadata express the real meaning of a web page or not (without a certain degree of AI). If an external mechanism is needed to determine trust level for metadata, that is to establish when an automation results are good or bad, such a mechanism may involve human beings at some stage, thus breaking automation (this is somehow similar to the problem of defining an "oracle machine" described by Turing, according to whom such a machine isn't an automaton). On another hand, a very custom model thought for very custom needs (and not requiring wide support) may be less prone to abuses, since it's unlikely to find someone willing to cheat himself. Thus, having third parties agreeing a certain model and related APIs, and implementing APIs on their own sides, might be more reliable in some cases (anyway, third parties should agree their respective metadata are reliable and find a way to evaluate they really are). Dan Brickley ha scritto: > On 3/1/09 16:54, H?kon Wium Lie wrote: >> Also sprach Dan Brickley: >> >> > My main problem with the natural language processing option is >> that it >> > feels too close to waiting for Artificial Intelligence. I'd >> rather add 6 >> > attributes to HTML and get on with life. >> >> :-) > > Another thought re NLP. RDFa (and similar, ...) are formats that can > be used for writing down the conclusions of NLP analysis. For example > here see the BBC's recent Muddy Boots experiment, using DBPedia > (Wikipedia in RDF) data to drive autoclassification / named entity > recognition. So here we can agree with Ian and others that text > analysis has much to offer, and still use RDFa (or other semantic > markup - i'll sidestep that debate for now) as a notation for marking > up the words with a machine-friendly indicator of their NLP-guessed > meaning. > > http://www.bbc.co.uk/blogs/journalismlabs/2008/12/muddy_boots.html > >> Personally, I think the 'class' attribute may still be a more >> compelling option in a less-is-more way. It already exists and can >> easily be used for styling purposes. Styling is bait for authors to >> disclose semantics. > > I'm sure there's mileage to be had there. I'm somehow incapable of > writing XSLT so GRDDL hasn't really charmed me, but 'class' certainly > corresponds to a lot of meaningful markup. Naturally enough it is > stronger at tagging bits of information with a category than at > defining relationships amongst the things defined when they're > scattered around the page. But that's no reason to dismiss it entirely. > > Did you see the RDF-EASE draft, > http://buzzword.org.uk/2008/rdf-ease/spec? From which comes: "Ten > second sales pitch: CSS is an external file that specifies how your > document should look; *RDF-EASE is an external file that specifies > what your document means.*" > > RDF-EASE uses CSS-based syntax. More discussion here, > http://lists.w3.org/Archives/Public/semantic-web/2008Dec/0148.html > including question of whether it ought to be expressed using > css3-namespace, > http://lists.w3.org/Archives/Public/semantic-web/2008Dec/0175.html > > chers, > > Dan > > -- > http://danbri.org/ > My question is: how often can I trust such a file specifies what your document really means, without evaluating its content? I'd distinguish two cases (not pretendig to make a complete classification), - The semantics described by metadata is used for server-side computations: there's no need to evaluate content (since I'm trusting to you when navigating your site, and it's unlikely to find you purposedly messing with yourself), as well as to have client-side support for such metadata (by the UA). This is the case of a centralised database. For instance, a *pedia page may send queries to the server, which elaborates them and sends results back the the user. - The UA must understand metadata and automatically gather informations meshed-up in a page from several sources: each source must be actively evaluated and trusted (a bot can't do such). This is the case of a decentralized database. For instance, that's easy to think of a spamming advertiser who apparently puts honest content into your pages (which maybe take reliable content from dbpedia), whereas he uses fake metadata to cheat my browser and send me irrelevant informations (or infos I'm not interested in) when I ask for related content [1], perhaps without you even guessing what's going on (and you may be loosing visitors because of that). For obvious reasons, a trust evaluation mechanism can't be as easy as getting/creating a signature to be used in a secure connection, because someone must actively evaluate at least two things: - the metadata really reflects a resource content, and - the metadata is properly used with respect to an external schema involved to model data (otherwise, no relationship would be reliable -- however, such might be a minor concern from a certain angle, since misused metadata might be less harmful than deliberately abused ones). The result can be very expensive (as certifying a driver or an application for a certain platform), or lead to a free choice to avoid any evaluation and instead to trust to any third parties. Both solutions may work, perhaps, for niche/limited cases, but I don't think such may be a good base for a "global" - and general purpose - automation. [1] That's not the same as using the @rel attribute without any relationship with other metadata: a UA may just provide a link somehow described as pointing to a related resource with respect to the surrounding content, so that I can choose to follow such a link or not; if the @rel attribute is used by an automated mechanism in response to a query and with respect to other metadata, the UA must decide on its own if a link is worth to be followed or not, and I don't think there is any easy way to take automated decisions involving trust. Best regards, Alex -- Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f Sponsor: Incrementa la visibilita' della tua azienda con l'invio di newsletter e campagne email marketing. * Con investimento di soli 250 Euro puoi incrementare la tua visibilita' Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8350&d=3-1
Received on Saturday, 3 January 2009 11:22:25 UTC