- From: Seth Russell <seth@robustai.net>
- Date: Sun, 22 Sep 2002 09:50:53 -0700
- To: rdfig <www-rdf-interest@w3.org>
Masahide Kanzaki wrote: >I have several ideas on extracting (screen scraping) meta data from >ordinary XHTML document, without any special syntax required. They will be >more useful for automated processing if we could tell agents with >media="meta" attribute which XSLT should be used for meta data extraction. > Good Idea :) >(If an XHTML author uses multi-word <th>, such as "Publishing Place", this >approach fails, but there will be some ways to manage these situations) > May I suggest that there is a fairly obvious transformation of multi word phrases to rdf compatable Id's. Remove punctuation, lower case the phrase (so will produce more matches) and substitute '_' for space. Where a phrase contains a quote, substitute '-' for space inside the quotation. For example: Publishing Place --> publishing_place "For Whom the Bell Tolls" by Earnest Hemengway --> for-whom-the-bell-tolls_by_earnest_hemengway Id's of this nature have the advantage of being round trippable between RDF, N3, Semenglish, and I believe KIF. Yet they are very human friendly too! Seth Russell http://robustai.net/sailor/
Received on Sunday, 22 September 2002 12:51:34 UTC