Re: PHP RDF fetching code from Hugh Glaser on 2010-01-28 (public-lod@w3.org from January 2010)

From: Hugh Glaser <hg@ecs.soton.ac.uk>
Date: Thu, 28 Jan 2010 12:26:29 +0000
To: Stephane Corlosquet <scorlosquet@gmail.com>
CC: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <EMEW3|da3f70751b24b3437a0fdc964f295579m0RCQa02hg|ecs.soton.ac.uk|C78732F5.FC4D%>
Thanks for the pointer.
(Won’t actually look at the ARC code at the moment, as it may be hard to comply with Benji’s license.)

However, rather than being as clever as possible, somehow I thought I should respect what the publisher said, so perhaps first Content-Type, then extension, rather than ignoring them.

The reason I wasn’t relying on rapper --guess is that the handover to rapper is part of the RDF store, and I will probably use other stores that don’t use rapper.
Also, I wanted to gather statistics on what RDF format people were using, and couldn’t see an option to rapper to tell me the input type that it guessed.

At the moment I record the Content-Type and the extension, and then let rapper or whatever do their magic – I guess that is enough.

Cheers
Hugh

On 28/01/2010 02:25, "Stephane Corlosquet" <scorlosquet@gmail.com> wrote:

Hugh,

The ARC2 parser has a "built-in RDF format detector" [1]. You might want to look at the code to see how it's done.

Why not using the --guess option of rapper?

Steph.

[1] http://arc.semsol.org/docs/v2/parsing

On Wed, Jan 27, 2010 at 9:08 PM, Hugh Glaser <hg@ecs.soton.ac.uk> wrote:
On 27/01/2010 09:49, "Tom Heath" <tom.heath@talis.com> wrote:

> +1 for Moriarty, whether you're working with the Platform or not. Ian
> and the other contributors have done a great job - personally I'd
> start here before writing any new code.
Too true mate.

Now my next bit of pissing about.
Before writing it (if I can find the gumption).
Don't think this is in Moriarty, as the Talis Platform is, of course, well-behaved.

I run cURL, using an amended version of what was described before (as at the end of this message).

So now I need to deal with what comes back.
I actually hand it over to rapper, so would sort of like to know what the data is to improve the reliability by setting the rapper type parameter.
I am trying to avoid looking inside the file, although am happy to if someone can provide the code :-).
The Content-Type is unreliable – for example could (is likely to) be text/plain for a turtle file that someone has put on a standard web server.
So it is the usual problem of messing about with extensions, modified by extra information from the Content-Type.
Of course we need to worry about the final URL (curl_getinfo($ch)['url']), possibly as well as the requesting URI, as that might be where there is an extension.
So perhaps something that sets the Content-Type in curl_getinfo($ch) as best it can?

Any offers? (Pretty please!)
And maybe we can feed back to Moriarty, PEAR, etc, unless already there and I missed it.

On another worry, If the requesting URI does a 302 to a new URI, which then does 303, it looks an interesting challenge to capture the new URI as expected. I don’t intend to do this at the moment, but if anyone has done that, ...

Enjoy.
Hugh

PHP much preferred.

Fetching code:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $_REQUEST['uri']);
curl_setopt($ch, CURLOPT_USERAGENT, "http://void.rkbexplorer.com/ submission agent 1.0");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept: application/rdf+xml, text/n3, text/rdf+n3, text/turtle, application/x-turtle, application/turtle, text/plain"));
$data = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

>
> My 2p worth :)
>
> Tom.
>
>
> 2010/1/26 Ian Davis <lists@iandavis.com>:
>> You may find something useful in my Moriarty project:
>>
>> http://code.google.com/p/moriarty/
>>
>> It's geared towards the Talis Platform but there is a lot of code in
>> there that has no dependencies on the platform, e.g.:
>>
>> http://code.google.com/p/moriarty/source/browse/trunk/httprequest.class.php
>>
>> some documentation for that class here:
>>
>> http://code.google.com/p/moriarty/wiki/HttpRequest
>>
>> Ian
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>>
>
>
Received on Thursday, 28 January 2010 12:27:28 UTC