content-type agnosticism and apache-era motif from carmen on 2011-03-13 (public-rdf-ruby@w3.org from March 2011)

From: carmen <_@whats-your.name>
Date: Sun, 13 Mar 2011 01:56:48 +0000
To: public-rdf-ruby@w3.org
Message-ID: <20110313015648.GA7349@11.Belkin>
an extension of Apache tradition really - point to a directory full of files, and serve it up. brutal simplicity got the web pretty far. eventually intra-site network effects attracting users, and most content-creation equalling entering 140 chars or less into a <textarea> rather than creating HTML files shifted axis to large hash-table/SQL storing their data in monolithic blobs on filesystem impervious to interaction/editing by anything but their own API

call me oldschool-futurist but i do long for the era where i could use whatever tool i want to edit content, not use a "content management system" that actually has only one way of doing things and one way of storing things, and cargo-cult engineers at the ready to make sure the SQL daemon is running, it has the right database created with the right permissions and the right PHP engine modules can be configured to connect to it with the matching permissions so i can copy/paste html snippets into a <textarea> in its JS-requiring interface which will get chewed up and spit out in a different "some-tags-missing" output later on adorned with all sorts of annoying sidebars and toolbars i'll never click on but got generated anyway. and the caching infrastructure to make sure this doesnt take 3-5 seconds on each request.. well i could just outsource this whole process to some enterprise VC-backed "CMS as a service" startup and just acccept it as normal practice


thanks to drobilla and swh and their LV2 efforts, i realized triples and RDF was a nice way to plumb data without writing Model classes over and over, which is what Rails thought i wanted to do, and a variety of rails conceptual-clones lke Merb & Ramaze copied without even thinking about.

so i began with a Model class in Rails. i even tried using ActiveRecord to persist RDF on SQL. querying was remarkably slow and involved a ton of roundtrips. more expressive than SPARQL since you had a turing-complete lang on your hands ahead of calls into AR but creating a graph DB as an AR Model was definitely out of their design scope in a world where adding a single property meant "migrations" eg, rubyscripts which modify the SQL table structure. at this point. circa 2005 the Ruby bindings to Redland already existed so kudos to dajobe for being ahead of me even in trying to make something halfway decent for Ruby. right away i realized the mess that was the explicit-finalizer GC and its frequent double-free segfaults and/or leaks and its SPARQL engine was even more abstracted from the data stores than AR since it supported so many underling persistence options. a few sniffs of the SQL connection confirmed the insanity ensuing in latent/cpu/netbandwidth-chewing trivial SPARQL queries, combined with the fact that even basic things like.. find all the resources matching res and res#frag1 and res#frag2 meant a PCRE filter on all the URIs in the system, or a wrapper around insert that curated doc-graph membership triples ahead of everything else. back in the day i could just glob doc*.. or just cat the doc into RAM. the way presbrey's datawiki seems to still do. maybe it wasnt a bad idea? that whole apache/httpd files on a FS thing

i'll (not) bore you with the details of getting Rails out of the way, accumulating a bunch of patches to Camping to make it more web-arch/conneg-ey compliant and the switch to Mongrel handlers, then to RyDahl's Ebb/Flow proto-nodeJS-in-ruby handlers, and eventually Rack (which i still could argue about its bizarre response typing - an Object which must respond to #each, not a file or plain string, or even Object responding to #read, which could accept a size argument and be more obvious in naming about what is going on) but besides this semantic quibble Rack is still what i use a nice right-on-HTTP abstraction without stuff i dont need, like "Routes" (an artifact of breaking up the URL into a ORM table-name and row ID to be fed onwards to the query) i tossed out this 3 years of git history since it's not relevant other than proving what i use now it not some sort of gospel (lets drink 3 cups of Starbucks and sit down and implement the RDF Abstract-Model as ruby-classes) but the product of an evolution towards something that wouldnt drive me crazy, and efficient for my pedestrian use cases[1]

so taking the Apache concept, how would extend it for linked-data? all it does out of the box is maps filename extensions to mime-types, sets appropriate headers and hands a file to client over the socket. this is problem #1. browser tries saving everything to disk that isnt already HTML. today's most common browsers save .rdf files as HTML, save for those with Tabulator installed, and nobody uses Firefox anymore since it's so slow to startup (4.0rc takes 20 seconds cold on my Atom netbook) and i personaly prefer writing Ruby to JS, plus i browse even my own sites with JS disabled and routinely read my morning mail in a links session in a screen window. so i do want a bit more than "send JSON files over the wire to some JS GUI". naturally with its plethora of largely rails-community borne libraries for HTML generation in anhy manner under the sun it was a good fit. personally i'm not crazy about Ruby, especially nonsense like blocks/procs/methods/lambdas when it should just have Functions, and a real typechecker, option for compile-time <http://blog.ezyang.com/2011/03/type-tech-tree/> type-sanity but it must be doing something right as after reading TaPL, PFDS, HSOE, RWH, TPPL i'm still using ruby since it fit like a glove and gets the job done

to be backwards-compatible with Apache, files are still served but there's now a number of ways to get arund this now

- include a file in a glob. implying more than one file 

- dont request the file's mime-type or */*

- add view or format to querystring

eg datafile.rdf?view is enough or myst.owl?format=text/html

- set the server to 'cook' the file. in which case add ?raw to get original

and its pretty straightforward from there

scope of request expands to resourceSet. 

if a comma key exists in the qs, we execute a _,p,o pattern-match on filesystem indices and the results become the resourceset. a range offset and limit and direction (ascending/descnding) are additional options

barring that, we try a path query, if a count exists, a depth-first range of the FS trie simply on filenames themselves is retrieved. eg goin to /mail/2011/03/11?c=3 will grab 3 files, add next/prev pagination resources to request model, and eventually if we cick 'previous' enough we'd end up at a previous day's files

if we still havent had any special overrides, we look for the glob of the URI. and common extensions off the base URI (.nt, .rdf, .html etc)

once we have a resourceSet, its expanded into a graph with triplizers. the graph is a Hash, for optimal simplicity and a vast standard library of manipulatino functions

naturally, not all resources correspond to local files. but since it is clear when a file has changed, you get a 3-tier caching if you do use files. a file is triplized once per change, a request-time model is generated once, and any renderings of this model only once

working within this convention, theres a lot of other things you could do. perhaps define a triplizer for .sparql name-extension, put the contents of the query in it and edit it with syntax highlighting in emacs/vim, and have it rerun whenever it is touched (or edited)

sometimes you really want to break out of this mindset, and define a custom graph but still get the conneg infrastructure and a rendering in the requested format

ive shipped a whole variety of triplizers and customizations by default, visible in this screenshot:

http://i574.photobucket.com/albums/ss187/ix9/hyper/IMG_0015.png

their functionality is discussed in seperate posts. 

a triplizer can be lifted to the domain of graph generation (file -> Triplestream) to (tripleStream -> graph) by graphFromStream :triplizerFn

first we try extensions, if nothing is suggested, we drop down to file(1). if it reports binary or text/plain and we know it is something else, it can be specified in the querystring. 

11 ~ file ./.config/chromium/Default/Bookmarks
./.config/chromium/Default/Bookmarks: ASCII text, with very long lines

i know it's JSON so i'll add q=json. this invokes the graph triplizer 

  def json
    yield uri, '/application/json', (JSON.parse read) if e
  end
  graphFromStream :json

very simple. i'm not going to attempt to glean semantic structure from it. and we can pass an arbitrary value right on through in object-position of "triple" like that

http://m/Bookmarks?q=json&view=application/json&sel=roots.bookmark_bar.children

we're explicitly overridden everything from the view to the triplizer and giving the view an argument to the view, zooms into the data we actally wanted

normally views and triplizers are chosen via the mime types

the default view function wont touch a string, so you could make a view called 'tilt' (i vaguelly recall a github project called tilt that called into damn near every ruby template library ever and rendered it for you) and make that the default

most of my chat about views are on public-semweb-ui list.. so i wont go into too much detail there

after graph generation and before rendering, named 'filters' can be specified. default has some sort/match type things to clear up a few corner cases where i might have been tempted into wantin SPARQL. you can define custom filters, which can just mutate the request model however they see fit

bypassing all of whats been mentioned so far can be done, the main benefit over doing so in Rack middleware is getting a resource class and utility functions attached to the request environment that are used everywhere else so code can be moved around (and so you dont have to fiddle with Rack middleware <http://tenderlovemaking.com/2011/03/03/rack-api-is-awkward/>. two ways:

on a particular URI:

/search/GET for example hooks into Groonga (a ruby plaintext search library + col-store), does the search, returns a resourceset, and a normal request ensues from there

or on any URI:

?y=http:404 (yes you can force any URL to 404, not sure why you need to)

or basic path-arithmetic redirects (current day/month's path)

curl -I http://n/m/?y=day
HTTP/1.1 303 See Other
Location: /m/2011/03/13/*?

to be bookmarked on a tablet or reader device..

[1] log-filterer http://blog.whats-your.name/post/2011/02/14/sifting-thru-your-IRC-logs-w/-examine
  exhibit-styles http://lists.w3.org/Archives/Public/public-semweb-ui/2011Feb/0001.html
   video library http://blog.whats-your.name/post/2011/02/06/AV
   schema search http://blog.whats-your.name/post/2011/01/05/finding-linked-data-property-names
    site archive http://blog.whats-your.name/public/bb.html
   audio library http://blog.whats-your.name/post/2011/02/03/find%281%29%2C-a-modern-and-fast-query-engine
    list archive http://lists.w3.org/Archives/Public/semantic-web/2010Dec/0119.html
       mail/news http://lists.w3.org/Archives/Public/public-rdf-ruby/2011Mar/0013.html

an emphasis on just files means you should be able to figure out synch - yeah rsync the crap out of everything, or use Git/Hg/Darcs or maybe Ceph

i do believe filesystems are obsolete, but thats another email, probably on a different list (wheres public-rdf-lang or public-rdf-haskell??)
Received on Sunday, 13 March 2011 01:57:31 UTC