Re: RDF.rb and format discovery

On Jun 30, 2010, at 11:55 AM, John Fieber wrote:

> 
> On Jun 30, 2010, at 10:21 AM, Gregg Kellogg wrote:
> 
>> I agree, I think that RDF::Reader.for needs to be somewhat smarter.
>> 
>> 	• The symbol case is limited to using an element of the classname (e.g. RDF::RDFXML => :rdfxml). It would be nice to specify alternate symbols (e.g., :rdf). Of course, this can be done through for(:extension => "rdf").
> 
> A proliferation of ways to say the same thing doesn't help in code readability.  It leads to things like...ahem...perl.

Fair enough, but it's screwed me up a couple of times. Hardly a necessary change.

>> 	• RDF::Reader.open, when loading a remote resource, should look at the returned Mime-Type to do a format match, rather than requiring it be provided explicitly. Arto seems to be of the opinion that this is done via LinkedData, but it seems to be a fair thing to do directly in RDF.rb
> 
> The fact that RDF::Reader.open works on remote resources at all is thanks to rest-openuri which, as Arto aptly pointed out at one point, is as dumb as a brick.  I would counter that RDF::Reader shouldn't be in the business of opening remote resources at all.  When dealing with HTTP resources, HTTP content type negotiation should be your first and last stop for determining what the data is.  You can do that inside RDF::Reader.open, but how do you solve the dumb-as-a-brick problem?  Or do it outside and use RDF::Reader.for with information gleaned from whatever HTTP stack you happen to like.

This may be better resolved through an additional gem, such as LinkedData, which makes Reader.open smarter.

>> 	• I believe that Format specifications should also provide a RegExp to match against the beginning of the content (I use the first 1000 bytes in RdfContext). This would be used within RDF::Reader.open in case a format couldn't be found through other uses, consider the following:
> 
> [...]
> 
>> This could instead be found by looping through available Format subclasses and looking for a #match method.  Within RDFXML::Format, I could perform the following:
>> 
>> class Format < RDF::Format
>>  MATCH = %r(<(\w+:)?RDF))
>> 
>>  content_type     'text/turtle', :extension => :ttl
>>  content_type     'text/n3', :extension => :n3
>>  content_encoding 'utf-8'
>> 
>>  reader { RDF::N3::Reader }
>>  writer { RDF::N3::Writer }
>> 
>>  def match(content)
>>    content.to_s.match(MATCH)
>>  end
>> end
>> 
>> In RDF::Reader.open, first look for a reader using the options. Then, failing that, open the file and look for a mime-type, failing that, loop through Format instances and see if the Format matches the string content.
> 
> This seems fine, though there would need to be some "magic detection" opt-in/opt-out, and care around when content is a really enormous IO stream.  It isn't any fun to have your process size explode to 3GB because of some .to_s lurking someplace you didn't expect it.

Fine, options[:autodetect] => true would do it. The to_s was a simplification, provisions for potentially larger streams should be made. If the stream responds to :read, for example, it could perform a stream.read(1000), stream.rewind to achieve the same thing.

> 
> A more general problem lurking is how to handle cases where multiple Format classes match a given criteria, be it mime type, extension or magic-match.

Certainly a separate problem, also better solved through some additional logic that provides for more granularity and prioritization. Perhaps outside the bounds of generalized Gems.

> -john
> 

Gregg

Received on Wednesday, 30 June 2010 19:31:36 UTC