Re: Parsing Freebase RDF from Kingsley Idehen on 2009-03-16 (public-lod@w3.org from March 2009)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Mon, 16 Mar 2009 09:26:36 -0400
CC: Jamie Taylor <jamie@metaweb.com>, public-lod@w3.org
Message-ID: <49BE538C.1@openlinksw.com>
Richard Cyganiak wrote:
> On 16 Mar 2009, at 09:21, Rob Styles wrote:
>
>> This is an interesting question and one which we've been thinking 
>> about here at Talis as well.
>>
>> As we build linked data apps, with a view to the linked data being 
>> used as an api for other applications, we've thought that it is worth 
>> putting more into the response, typically we try to put everything 
>> you'd need to recreate the HTML representation.
>
> Yes, I think that's an excellent approach.
>
>> When you say it has important implications, can you expand on those? 
>> I had been thinking it was harmless. As I see it a client that 
>> expects only a DESCRIBE ?s should simply ignore the additional data 
>> provided, whereas clients that are crawling and merging into a graph 
>> will find they already have things as they expand what they know about.
>
> The main implication of choosing a less regular pattern is that others 
> cannot accurately re-create the linked data view from an RDF dump of 
> the dataset. For example, Sindice will index your dataset from your 
> RDF dump if you publish one and announce it through a semantic 
> sitemap. But Sindice will assume that each of your linked data 
> documents only contains the immediate surrounding triples of the 
> described resource. If you have additional triples in there, Sindice 
> will not know it because that fact is not visible from just looking at 
> the dump. The consequence is that searching in Sindice will sometimes 
> miss one of your documents even if it contains all the right 
> keywords/URIs.
>
> But that shouldn't affect how you publish your linked data, after all 
> the dumps are merely an optimization that allows easy bulk processing 
> of your linked data.
>
>> I can see that understanding what is likely to come back has big 
>> optimisation benefits for things like Sindice.
>
> Yes.
>
>> What is the 'correct' thing to do?
>
> For your linked data, you're doing the correct thing.
>
> If you produce RDF dumps, and you want Sindice and others to be able 
> to re-produce the structure of your linked data documents with maximum 
> fidelity, then consider producing quad dumps in N-Quads format [1] 
> instead of straight RDF dumps.
Jamie,

I would expect Quad Dumps to actually be quite natural for Freebase, right?

Kingsley
>
> Best,
> Richard
>
> [1] http://sw.deri.org/2008/07/n-quads/
>
>
>
>
>
>>
>>
>> rob
>>
>>
>>
>>
>> On 14 Mar 2009, at 12:12, Giovanni Tummarello wrote:
>>
>>> Hi Jamie,
>>>
>>> i see that your RDF per URI is more "expressive" than the "usual"
>>>
>>> instead of just giving triples out of (or into) the subject of the
>>> page you also give the description of other notable entities inside
>>>
>>> for example in the blade runner movie you give the full description of
>>> all the "film performances" (tying the real actor, the fictional
>>> character and the movie).  Each film performance then has its URI
>>> which is itself resolvable so  "in theory" to give the detail of the
>>> "film performance" was not necessary, according to LOD, but in
>>> practice its definitly useful as we know.
>>>
>>> Would you know the rule by which you decide to put multiple entities
>>> in the description that you give out?
>>> this has important implications.
>>>
>>> On the one hand if there was a simple rule, always the same, it makes
>>> it easy for me to get your snapshot and index each URI rdf description
>>> by applying this same rule (what we do for LOD datasets which simply
>>> split "all the triples with subject or object X"). Else i can crawl
>>> and do my things internally, under the assumption that what you are
>>> providing are not a bunch of unrelated RDF files, but are really
>>> "slices" of the same dataset.
>>>
>>> to assert this is the case (and allow me to play more freely with the
>>> information) it would be useful to have a semantic sitemap linked in
>>> your robot.txt stating the URI of the dataset, with the name and the
>>> prefix at which you're serving its content as LinkedData.
>>>
>>> example sitemap. Here the "slicing" is set to "subject-object" in your
>>> case i guess not setting it is the most appropriate option probably.
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
>>>       
>>> xmlns:sc="http://sw.deri.org/2007/07/sitemapextension/scschema.xsd">
>>> <sc:dataset>
>>>   <sc:datasetLabel>Example Corp. Product Catalog</sc:datasetLabel>
>>>   <sc:datasetURI>http://example.com/catalog.rdf#catalog</sc:datasetURI>
>>>   <sc:linkedDataPrefix
>>> slicing="subject-object">http://example.com/products/</sc:linkedDataPrefix> 
>>>
>>>   <sc:sampleURI>http://example.com/products/widgets/X42</sc:sampleURI>
>>>   <sc:sparqlEndpointLocation
>>> slicing="subject-object">http://example.com/sparql</sc:sparqlEndpointLocation> 
>>>
>>>   
>>> <sc:dataDumpLocation>http://example.com/data/catalogdump.rdf.gz</sc:dataDumpLocation> 
>>>
>>>
>>>   <changefreq>weekly</changefreq>
>>> </sc:dataset>
>>> </urlset>
>>>
>>> in your case would it be technically simple to also provide an RDF 
>>> dump?
>>> "no its too time consuming" is a prefectly good answer :-) (which
>>> means we have to live with it, e.g. by politely crawling)
>>>
>>> Giovanni
>>>
>>> On Fri, Mar 13, 2009 at 8:37 PM, Jamie Taylor <jamie@metaweb.com> 
>>> wrote:
>>>> Seo -
>>>>
>>>> Yes, this is a bug in the current LOD/RDF interface to Freebase.  I 
>>>> believe
>>>> it is fixed in the upcoming release, which can be previewed at
>>>> http://rdftest.mqlx.com/ns/en.blade_runner..
>>>>
>>>> I checked turtle output with:
>>>> rapper -i turtle http://rdftest.mqlx.com/ns/en.blade_runner
>>>>
>>>> Please give this sandbox version of the interface a try.  I'm 
>>>> interested in
>>>> feedback from others on the list as well.
>>>>
>>>> I hope to have the new version in production sometime next week.
>>>>
>>>> Jamie
>>>>
>>>> On Mar 10, 2009, at 10:31 PM, Seo Sanghyeon wrote:
>>>>
>>>>> Hello, new to the list,
>>>>>
>>>>> I am trying to figure out how to use Freebase RDF service.
>>>>> (See 
>>>>> http://blog.freebase.com/2008/10/30/introducing_the_rdf_service/)
>>>>>
>>>>> $ curl -L http://rdf.freebase.com/ns/en.blade_runner -o 
>>>>> en.blade_runner
>>>>> $ rdfproc freebase parse en.blade_runner turtle
>>>>>
>>>>> It is Turtle, right? Above errors with:
>>>>>
>>>>> rdfproc: Parsing URI
>>>>> file:///home/tinuviel/devel/freebase/en.blade_runner with turtle
>>>>> parser
>>>>> rdfproc: Error - URI
>>>>> file:///home/tinuviel/devel/freebase/en.blade_runner:2: The namespace
>>>>> prefix in "http:" was not declared.
>>>>> URI file:///home/tinuviel/devel/freebase/en.blade_runner:2 raptor
>>>>> fatal error - turtle_qname_to_uri failed
>>>>> rdfproc: Error - URI
>>>>> file:///home/tinuviel/devel/freebase/en.blade_runner:2: syntax error
>>>>> rdfproc: Failed to parse into the graph
>>>>> rdfproc: The parsing returned 2 errors and 0 warnings
>>>>>
>>>>> Help?
>>>>>
>>>>> -- 
>>>>> Seo Sanghyeon
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>> Rob Styles
>> tel: +44 (0)870 400 5000
>> fax: +44 (0)870 400 5001
>> mobile: +44 (0)7971 475 257
>> msn: mmmmmrob@yahoo.com
>> irc: irc.freenode.net/mmmmmrob,isnick
>> web: http://www.talis.com/
>> blog: http://www.dynamicorange.com/blog/
>> blog: http://blogs.talis.com/panlibus/
>> blog: http://blogs.talis.com/nodalities/
>> blog: http://blogs.talis.com/n2/
>> Please consider the environment before printing this email.
>>
>> Find out more about Talis at www.talis.com
>> shared innovationTM
>>
>> Any views or personal opinions expressed within this email may not be 
>> those of Talis Information Ltd or its employees. The content of this 
>> email message and any files that may be attached are confidential, 
>> and for the usage of the intended recipient only. If you are not the 
>> intended recipient, then please return this message to the sender and 
>> delete it. Any use of this e-mail by an unauthorised recipient is 
>> prohibited.
>>
>> Talis Information Ltd is a member of the Talis Group of companies and 
>> is registered in England No 3638278 with its registered office at 
>> Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB.
>>
>
>
>


-- 


Regards,

Kingsley Idehen	      Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO 
OpenLink Software     Web: http://www.openlinksw.com
Received on Monday, 16 March 2009 13:27:12 UTC