Re: Parsing Freebase RDF from Richard Cyganiak on 2009-03-16 (public-lod@w3.org from March 2009)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 16 Mar 2009 13:06:15 +0000
To: "Rob Styles" <rob.styles@talis.com>
Cc: <giovanni.tummarello@deri.org>, "Jamie Taylor" <jamie@metaweb.com>, "Seo Sanghyeon" <sanxiyn@gmail.com>, <public-lod@w3.org>
Message-Id: <3EA5BFED-0BE1-47A0-A2A6-F6C05446F9CC@cyganiak.de>
On 16 Mar 2009, at 09:21, Rob Styles wrote:

> This is an interesting question and one which we've been thinking  
> about here at Talis as well.
>
> As we build linked data apps, with a view to the linked data being  
> used as an api for other applications, we've thought that it is  
> worth putting more into the response, typically we try to put  
> everything you'd need to recreate the HTML representation.

Yes, I think that's an excellent approach.

> When you say it has important implications, can you expand on those?  
> I had been thinking it was harmless. As I see it a client that  
> expects only a DESCRIBE ?s should simply ignore the additional data  
> provided, whereas clients that are crawling and merging into a graph  
> will find they already have things as they expand what they know  
> about.

The main implication of choosing a less regular pattern is that others  
cannot accurately re-create the linked data view from an RDF dump of  
the dataset. For example, Sindice will index your dataset from your  
RDF dump if you publish one and announce it through a semantic  
sitemap. But Sindice will assume that each of your linked data  
documents only contains the immediate surrounding triples of the  
described resource. If you have additional triples in there, Sindice  
will not know it because that fact is not visible from just looking at  
the dump. The consequence is that searching in Sindice will sometimes  
miss one of your documents even if it contains all the right keywords/ 
URIs.

But that shouldn't affect how you publish your linked data, after all  
the dumps are merely an optimization that allows easy bulk processing  
of your linked data.

> I can see that understanding what is likely to come back has big  
> optimisation benefits for things like Sindice.

Yes.

> What is the 'correct' thing to do?

For your linked data, you're doing the correct thing.

If you produce RDF dumps, and you want Sindice and others to be able  
to re-produce the structure of your linked data documents with maximum  
fidelity, then consider producing quad dumps in N-Quads format [1]  
instead of straight RDF dumps.

Best,
Richard

[1] http://sw.deri.org/2008/07/n-quads/





>
>
> rob
>
>
>
>
> On 14 Mar 2009, at 12:12, Giovanni Tummarello wrote:
>
>> Hi Jamie,
>>
>> i see that your RDF per URI is more "expressive" than the "usual"
>>
>> instead of just giving triples out of (or into) the subject of the
>> page you also give the description of other notable entities inside
>>
>> for example in the blade runner movie you give the full description  
>> of
>> all the "film performances" (tying the real actor, the fictional
>> character and the movie).  Each film performance then has its URI
>> which is itself resolvable so  "in theory" to give the detail of the
>> "film performance" was not necessary, according to LOD, but in
>> practice its definitly useful as we know.
>>
>> Would you know the rule by which you decide to put multiple entities
>> in the description that you give out?
>> this has important implications.
>>
>> On the one hand if there was a simple rule, always the same, it makes
>> it easy for me to get your snapshot and index each URI rdf  
>> description
>> by applying this same rule (what we do for LOD datasets which simply
>> split "all the triples with subject or object X"). Else i can crawl
>> and do my things internally, under the assumption that what you are
>> providing are not a bunch of unrelated RDF files, but are really
>> "slices" of the same dataset.
>>
>> to assert this is the case (and allow me to play more freely with the
>> information) it would be useful to have a semantic sitemap linked in
>> your robot.txt stating the URI of the dataset, with the name and the
>> prefix at which you're serving its content as LinkedData.
>>
>> example sitemap. Here the "slicing" is set to "subject-object" in  
>> your
>> case i guess not setting it is the most appropriate option probably.
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
>>       xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd 
>> ">
>> <sc:dataset>
>>   <sc:datasetLabel>Example Corp. Product Catalog</sc:datasetLabel>
>>   <sc:datasetURI>http://example.com/catalog.rdf#catalog</ 
>> sc:datasetURI>
>>   <sc:linkedDataPrefix
>> slicing="subject-object">http://example.com/products/</ 
>> sc:linkedDataPrefix>
>>   <sc:sampleURI>http://example.com/products/widgets/X42</ 
>> sc:sampleURI>
>>   <sc:sparqlEndpointLocation
>> slicing="subject-object">http://example.com/sparql</ 
>> sc:sparqlEndpointLocation>
>>   <sc:dataDumpLocation>http://example.com/data/catalogdump.rdf.gz</ 
>> sc:dataDumpLocation>
>>
>>   <changefreq>weekly</changefreq>
>> </sc:dataset>
>> </urlset>
>>
>> in your case would it be technically simple to also provide an RDF  
>> dump?
>> "no its too time consuming" is a prefectly good answer :-) (which
>> means we have to live with it, e.g. by politely crawling)
>>
>> Giovanni
>>
>> On Fri, Mar 13, 2009 at 8:37 PM, Jamie Taylor <jamie@metaweb.com>  
>> wrote:
>>> Seo -
>>>
>>> Yes, this is a bug in the current LOD/RDF interface to Freebase.   
>>> I believe
>>> it is fixed in the upcoming release, which can be previewed at
>>> http://rdftest.mqlx.com/ns/en.blade_runner..
>>>
>>> I checked turtle output with:
>>> rapper -i turtle http://rdftest.mqlx.com/ns/en.blade_runner
>>>
>>> Please give this sandbox version of the interface a try.  I'm  
>>> interested in
>>> feedback from others on the list as well.
>>>
>>> I hope to have the new version in production sometime next week.
>>>
>>> Jamie
>>>
>>> On Mar 10, 2009, at 10:31 PM, Seo Sanghyeon wrote:
>>>
>>>> Hello, new to the list,
>>>>
>>>> I am trying to figure out how to use Freebase RDF service.
>>>> (See http://blog.freebase.com/2008/10/30/introducing_the_rdf_service/)
>>>>
>>>> $ curl -L http://rdf.freebase.com/ns/en.blade_runner -o  
>>>> en.blade_runner
>>>> $ rdfproc freebase parse en.blade_runner turtle
>>>>
>>>> It is Turtle, right? Above errors with:
>>>>
>>>> rdfproc: Parsing URI
>>>> file:///home/tinuviel/devel/freebase/en.blade_runner with turtle
>>>> parser
>>>> rdfproc: Error - URI
>>>> file:///home/tinuviel/devel/freebase/en.blade_runner:2: The  
>>>> namespace
>>>> prefix in "http:" was not declared.
>>>> URI file:///home/tinuviel/devel/freebase/en.blade_runner:2 raptor
>>>> fatal error - turtle_qname_to_uri failed
>>>> rdfproc: Error - URI
>>>> file:///home/tinuviel/devel/freebase/en.blade_runner:2: syntax  
>>>> error
>>>> rdfproc: Failed to parse into the graph
>>>> rdfproc: The parsing returned 2 errors and 0 warnings
>>>>
>>>> Help?
>>>>
>>>> --
>>>> Seo Sanghyeon
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
> Rob Styles
> tel: +44 (0)870 400 5000
> fax: +44 (0)870 400 5001
> mobile: +44 (0)7971 475 257
> msn: mmmmmrob@yahoo.com
> irc: irc.freenode.net/mmmmmrob,isnick
> web: http://www.talis.com/
> blog: http://www.dynamicorange.com/blog/
> blog: http://blogs.talis.com/panlibus/
> blog: http://blogs.talis.com/nodalities/
> blog: http://blogs.talis.com/n2/
> Please consider the environment before printing this email.
>
> Find out more about Talis at www.talis.com
> shared innovationTM
>
> Any views or personal opinions expressed within this email may not  
> be those of Talis Information Ltd or its employees. The content of  
> this email message and any files that may be attached are  
> confidential, and for the usage of the intended recipient only. If  
> you are not the intended recipient, then please return this message  
> to the sender and delete it. Any use of this e-mail by an  
> unauthorised recipient is prohibited.
>
> Talis Information Ltd is a member of the Talis Group of companies  
> and is registered in England No 3638278 with its registered office  
> at Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB.
>
Received on Monday, 16 March 2009 13:07:00 UTC