W3C home > Mailing lists > Public > public-rdf-in-xhtml-tf@w3.org > May 2006

Finding RDFa content on the web

From: Tim Finin <finin@cs.umbc.edu>
Date: Mon, 29 May 2006 13:23:48 -0400
Message-ID: <447B2E24.3020608@cs.umbc.edu>
To: public-rdf-in-xhtml task force <public-rdf-in-xhtml-tf@w3.org>
CC: Tim Finin <finin@cs.umbc.edu>, Li Ding <dingli1@cs.umbc.edu>

We'd like to extent our Swoogle semantic web search engine
[1] to find and index content encoded in RDFa.  Swoogle's
database currently has extensive metadata on about 1M RDF
documents  and 350K HTML documents with embedded RDF.

If we can develop en effective way to discover XHTML
documents with RDFa content, Swoogle could be used to track
and monitor RDFa's adoption, who is using it and how it's
being used.

Our problem is how to find pages likely to have RDFa
content.  Swoogle doesn't exhaustively crawl the Web for
documents with semantic web content but instead uses an
adaptive hybrid strategy [2] that starts with conventional
web search engines to discover initial seed documents.

The basic approach is to (1) use Google to find initial seed
documents; (2) drill down with subsequent site-specific
queries to find more; (3) employ a focused HTML crawler to
discover yet more; and (4) use an RDF scutter to discover
still more.

My question is, are there Google queries that will be useful
for finding XHTML documents with RDFa content?  For example,
a Google query file 'rdf -rss filetype:rdf' produces lots of
RDF documents. I tried searches like '"rel=" "html xmlns:"'
but virtually all of the of the documents found are using
conventional uses of the rel attribute.

If anyone has suggestions for search engine queries that
might be good at finding RDFa content, please let me know.
If there aren't any, maybe it would be good to develop a
convention by which an XHTML document can assert that it has
RDFa content and to encourage it's use as a best practice.

Tim

[1] http://swoogle.umbc.edu/
[2] 
http://ebiquity.umbc.edu/paper/html/id/304/Search-Engines-for-Semantic-Web-Knowledge


-- 
  Tim Finin, Computer Science & Electrical Engineering, Univ of Maryland
  Baltimore County, 1000 Hilltop Cir, Baltimore MD 21250. finin@umbc.edu
  http://ebiquity.umbc.edu 410-455-3522 fax:-3969 http://umbc.edu/~finin
Received on Monday, 29 May 2006 17:24:01 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:15:02 GMT