Masahide Kanzaki wrote: >I have several ideas on extracting (screen scraping) meta data from >ordinary XHTML document, without any special syntax required. They will be >more useful for automated processing if we could tell agents with >media="meta" attribute which XSLT should be used for meta data extraction. > Good Idea :) >(If an XHTML author uses multi-word <th>, such as "Publishing Place", this >approach fails, but there will be some ways to manage these situations) > May I suggest that there is a fairly obvious transformation of multi word phrases to rdf compatable Id's. Remove punctuation, lower case the phrase (so will produce more matches) and substitute '_' for space. Where a phrase contains a quote, substitute '-' for space inside the quotation. For example: Publishing Place --> publishing_place "For Whom the Bell Tolls" by Earnest Hemengway --> for-whom-the-bell-tolls_by_earnest_hemengway Id's of this nature have the advantage of being round trippable between RDF, N3, Semenglish, and I believe KIF. Yet they are very human friendly too! Seth Russell http://robustai.net/sailor/Received on Sunday, 22 September 2002 12:51:34 GMT
This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:20:01 GMT