Re: RDF.rb and format discovery

I agree, I think that RDF::Reader.for needs to be somewhat smarter.


 *   The symbol case is limited to using an element of the classname (e.g. RDF::RDFXML => :rdfxml). It would be nice to specify alternate symbols (e.g., :rdf). Of course, this can be done through for(:extension => "rdf").
 *   RDF::Reader.open, when loading a remote resource, should look at the returned Mime-Type to do a format match, rather than requiring it be provided explicitly. Arto seems to be of the opinion that this is done via LinkedData, but it seems to be a fair thing to do directly in RDF.rb
 *   I believe that Format specifications should also provide a RegExp to match against the beginning of the content (I use the first 1000 bytes in RdfContext). This would be used within RDF::Reader.open in case a format couldn't be found through other uses, consider the following:

# Heuristically detect the input stream
def detect_format(stream)
  # Got to look into the file to see
  if stream.respond_to?(:rewind)
    stream.rewind
    string = stream.read(1000)
    stream.rewind
  else
    string = stream.to_s
  end
  case string
  when /<(\w+:)?RDF/  then :rdfxml
  when /<\w+:)?html/i then :rdfa
  when /@prefix/i     then :n3
  else                     :ntriples
  end
end

This could instead be found by looping through available Format subclasses and looking for a #match method.  Within RDFXML::Format, I could perform the following:

class Format < RDF::Format
  MATCH = %r(<(\w+:)?RDF))

  content_type     'text/turtle', :extension => :ttl
  content_type     'text/n3', :extension => :n3
  content_encoding 'utf-8'

  reader { RDF::N3::Reader }
  writer { RDF::N3::Writer }

  def match(content)
    content.to_s.match(MATCH)
  end
end

In RDF::Reader.open, first look for a reader using the options. Then, failing that, open the file and look for a mime-type, failing that, loop through Format instances and see if the Format matches the string content.

In most cases, this will do what the user expects.

Gregg

On Jun 30, 2010, at 2:03 AM, Hellekin O. Wolf wrote:

Hi,

I was looking into supporting more formats for FOAFSSL-ruby, including
the recently released rdf-rdfa and rdf-n3 gems.

But what I found looks like hell:

- there doesn't seem to be a reliable way of discovering the FOAF
file format,
- different formats will fail with different errors,
- when no format is given, RDF::Graph won't detect the right one (and
give unpredictable results)

The original way of doing it in FOAFSSL-ruby is to try it, and
fallback to a different format on failure.  It works, but it's so ugly
my grand-mother died.  When I tried to add new formats, I had to find
another solution.

I went for the following (ugly) algorithm (now, my grand-mother is
already dead):

1. lookup the file extension in the given WebID
2. lookup the Content-Type after an HTTP HEAD to the WebID
3. GET the file and identify it from its contents
4. fail if the format isn't known by now.

That gives a pretty good image of a house of cards, if any.

Any idea how to deal properly with auto-discovery of formats?

==
hk

Received on Wednesday, 30 June 2010 17:22:12 UTC