W3C home > Mailing lists > Public > w3c-rdfcore-wg@w3.org > July 2003

Re: RDF Use Case: scraping metadata from the web

From: Martin Duerst <duerst@w3.org>
Date: Thu, 31 Jul 2003 14:25:14 -0400
Message-Id: <4.2.0.58.J.20030731134004.069a2960@localhost>
To: Brian McBride <bwm@hplb.hpl.hp.com>, rdf core <w3c-rdfcore-wg@w3.org>, i18n <w3c-i18n-ig@w3.org>

Hello Brian,

Many thanks for writing this up. Some comments below:

At 18:05 03/07/31 +0100, Brian McBride wrote:

>This is a use case concerning xml literals which we identified during our 
>discussions this week, and a little analysis of it.  The use case may need 
>some refining to fully capture i18n concerns.
>
>Consider an application which is building an RDF store of metadata 
>about  web pages.  It crawls the web extracting title information from web 
>pages and storing then represents this data as RDF.

And this application may scrape data from other things than (HTML)
Web pages. And two independent applications may scrape the same data.


>Lets say it is searching for <title> elements, which may contain arbritary 
>markup.  Trying for example:
>
>   <title><em>title</em></title>

>But I note that XHTML 1.1 does not allow markup in titles!

Yes. As I mentioned in one of my mails, we are working with
the HTML WG to fix this in XHTML 2.0.

An application may also scrape <h1> elements. <h1> can contain
markup, and although often misused just to control font size,
its semantics are close to <title>.


>How does the application represent this in RDF?
>
>Since you can't use markup in a title element, use a plain literal :)
>
>But lets assume we are far sighted and assume that markup will be allowed 
>in titles in the future.
>
>Well, in that case, you could use an rdf:XMLLiteral and include a span 
>element to hold the lang tag.
>
>Objection: But then you couldn't use that literal with XHTML 1.1.
>
>Response: Record that information separately in the graph e.g.
>
>   <rdf:Description rdf:about="...">
>     <ex:title>
>       <rdf:Description>
>         <rdf:value rdf:parseType="Literal">chat</rdf:value>
>         <ex:lang>en</ex:lang>
>       </rdf:Description>
>     </...
>
>
>Objection:  You've changed the title.  You can't recover the exact markup 
>that was there in the first place because you can't tell whether the span 
>was added by the crawler or was there in the first place.

Yes. This is called markup integrity (or violation of markup integrity).


>Response: Most of the time, you won't care.  If you do care, you can 
>record the extra information in the graph.

We can always record any kind of information in the graph. But the
problem there is that different people/applications will do this
differently. So now we got from the problem of not knowing what kind
of <dummy> markup to use (or what has been used) to the problem of not
knowing what kind of graph structure to use (or what has been used).
[or even worse, not to know what kind of <dummy> markup or graph
structure has been used]:

For example, in a recent discussion,
Tim was proposing to do something like:

   <rdf:Description rdf:about="...">
     <ex:title>
       <rdf:Description>
         <&lang;en rdf:parseType="Literal">chat</rdf:value>
       </rdf:Description>
     </...

(i.e. making languages properties). There is also Patrick's
proposal about reification (see 
http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2003Jul/0069.html).

This is quite contrary to what we need; xml:lang was defined in
XML, and not elsewhere, so that a wide range of applications could
make use of it in an uniform way. The handling of language information
in RDF M&S was also designed with the goal of allowing applications
to make use of it in a uniform way.


>Objection: <span> is html specific.  you might want to use the literal in 
>another context.
>
>Response: Really need to refine the use case here, but in general if you 
>are not prepared to commit to a specific markup language, you can use the 
>graph to represent the underlying structure.

Same problems as above, I guess.


Regards,     Martin.
Received on Thursday, 31 July 2003 14:39:04 EDT

This archive was generated by hypermail pre-2.1.9 : Wednesday, 3 September 2003 09:58:53 EDT