RDF Use Case: scraping metadata from the web

This is a use case concerning xml literals which we identified during 
our discussions this week, and a little analysis of it.  The use case 
may need some refining to fully capture i18n concerns.

Consider an application which is building an RDF store of metadata about 
  web pages.  It crawls the web extracting title information from web 
pages and storing then represents this data as RDF.

Lets say it is searching for <title> elements, which may contain 
arbritary markup.  Trying for example:

   <title><em>title</em></title>

Hmm, checking Amaya behaves oddly in this situation, and Mozilla gets it 
wrong.  And the validator objects - says you ain't allowed <em> in 
titles.  This is XHTML 1.1.  Lets try span.  No, that doesn't seem to be 
legal either.

<title><span xml:lang="en">title</span></title>

Doesn't validate.  Checking, the content model for the title element is 
PCDATA. Ok, lets suppose its:

[[
<head xml:lang="en">
   <title>chat</title>
</head>
]]

That validates.  But I note that XHTML 1.1 does not allow markup in titles!

How does the application represent this in RDF?

Since you can't use markup in a title element, use a plain literal :)

But lets assume we are far sighted and assume that markup will be 
allowed in titles in the future.

Well, in that case, you could use an rdf:XMLLiteral and include a span 
element to hold the lang tag.

Objection: But then you couldn't use that literal with XHTML 1.1.

Response: Record that information separately in the graph e.g.

   <rdf:Description rdf:about="...">
     <ex:title>
       <rdf:Description>
         <rdf:value rdf:parseType="Literal">chat</rdf:value>
         <ex:lang>en</ex:lang>
       </rdf:Description>
     </...


Objection:  You've changed the title.  You can't recover the exact 
markup that was there in the first place because you can't tell whether 
the span was added by the crawler or was there in the first place.

Response: Most of the time, you won't care.  If you do care, you can 
record the extra information in the graph.

Objection: <span> is html specific.  you might want to use the literal 
in another context.

Response: Really need to refine the use case here, but in general if you 
are not prepared to commit to a specific markup language, you can use 
the graph to represent the underlying structure.

Brian

Received on Thursday, 31 July 2003 13:07:18 UTC