W3C home > Mailing lists > Public > public-rdf-ruby@w3.org > June 2010

Re: RDF.rb and format discovery

From: John Fieber <jrf@ursamaris.org>
Date: Wed, 30 Jun 2010 11:55:15 -0700
CC: "Hellekin O. Wolf" <hellekin@cepheide.org>, "public-rdf-ruby@w3.org" <public-rdf-ruby@w3.org>
Message-ID: <BDDF9503-FC79-4D35-87AE-2ADCC5EA2226@ursamaris.org>
To: Gregg Kellogg <gregg@kellogg-assoc.com>

On Jun 30, 2010, at 10:21 AM, Gregg Kellogg wrote:

> I agree, I think that RDF::Reader.for needs to be somewhat smarter.
> 
> 	 The symbol case is limited to using an element of the classname (e.g. RDF::RDFXML => :rdfxml). It would be nice to specify alternate symbols (e.g., :rdf). Of course, this can be done through for(:extension => "rdf").

A proliferation of ways to say the same thing doesn't help in code readability.  It leads to things like...ahem...perl.

> 	 RDF::Reader.open, when loading a remote resource, should look at the returned Mime-Type to do a format match, rather than requiring it be provided explicitly. Arto seems to be of the opinion that this is done via LinkedData, but it seems to be a fair thing to do directly in RDF.rb

The fact that RDF::Reader.open works on remote resources at all is thanks to rest-openuri which, as Arto aptly pointed out at one point, is as dumb as a brick.  I would counter that RDF::Reader shouldn't be in the business of opening remote resources at all.  When dealing with HTTP resources, HTTP content type negotiation should be your first and last stop for determining what the data is.  You can do that inside RDF::Reader.open, but how do you solve the dumb-as-a-brick problem?  Or do it outside and use RDF::Reader.for with information gleaned from whatever HTTP stack you happen to like.

> 	 I believe that Format specifications should also provide a RegExp to match against the beginning of the content (I use the first 1000 bytes in RdfContext). This would be used within RDF::Reader.open in case a format couldn't be found through other uses, consider the following:

[...]

> This could instead be found by looping through available Format subclasses and looking for a #match method.  Within RDFXML::Format, I could perform the following:
> 
> class Format < RDF::Format
>   MATCH = %r(<(\w+:)?RDF))
> 
>   content_type     'text/turtle', :extension => :ttl
>   content_type     'text/n3', :extension => :n3
>   content_encoding 'utf-8'
> 
>   reader { RDF::N3::Reader }
>   writer { RDF::N3::Writer }
> 
>   def match(content)
>     content.to_s.match(MATCH)
>   end
> end
> 
> In RDF::Reader.open, first look for a reader using the options. Then, failing that, open the file and look for a mime-type, failing that, loop through Format instances and see if the Format matches the string content.

This seems fine, though there would need to be some "magic detection" opt-in/opt-out, and care around when content is a really enormous IO stream.  It isn't any fun to have your process size explode to 3GB because of some .to_s lurking someplace you didn't expect it.

A more general problem lurking is how to handle cases where multiple Format classes match a given criteria, be it mime type, extension or magic-match.

-john
Received on Wednesday, 30 June 2010 18:55:54 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 30 June 2010 18:55:55 GMT