Re: triple Indexing for Apps like Cimba from carmen on 2015-02-03 (public-rww@w3.org from February 2015)

From: carmen <_@whats-your.name>
Date: Tue, 3 Feb 2015 21:08:31 +0000
To: public-rww@w3.org
Message-ID: <20150203210831.GA32717@x.clearwire-wmx.net>

> you're talking about sorting things to put them into an index

i'm talking about indexing strategies using basic tools like filenames and lists of URIs - alternatives to SPARQL in scenarios like your mobile phone or 2009-era notebook PC only has 512 or 1024M of RAM and the JAVA HTTP+MVC+ORM+SPARQL-Engine stack wants 2GB and 2 minutes of CPU-time just to launch and JIT-itself.. maybe even on a phone that came with busybox (minimalist shell utils) and could fairly-easily run a Python script and get things done w/ basic fs std-lib functions w/o heaping on tons of dependencies which might be hard to shoehorn into a mobile-app packager or github gist.. making the whole system more Rube-Goldberg and requiring a package-manager, bindmounts, chroots + containers, additional auth configuration for 3rd party DBs. so working within constraints. it looks like maybe UNIX was designed with dumb indexing/sorting use-cases in mind.. just change "one word per line" to "one URI per line":

https://www.youtube.com/watch?v=tc4ROCJYbm0&t=357

am interested in hearing if you come up with lightweight approaches like one SQLite DB per "named graph", one appendable uri-list file per predicateURI.. as i probably put *too* much on the filesystem , which can be hard to assess disk-usage as dirs/links sometime only eat the 'metadata' allocation that shows up in df but not du for the particular subtree..

(s _ _) (s p _) (s _ o) (s p o) can be matched nearly-instantly by reading s' Turtle file into RAM
(_ p o) (_ _ o) (_ p _) need some help
(_ _ _) is select * so knock yourself out

/index/ p / o / s  seems common here "find resources with this property and/or value"

shortening p helps, there's qnames/prefixes there, no reason to create 5 dirs deep:
/index/sioc:reply_of/msg/73f/54CD1B02.70304@whonix.org/msg/1bb/54CEC199.9080206@riseup.net

you don't get much of a choice w/ the messageID, usually want a range of results sorted by date or something for pagination, which is where hardlinks to new paths comes in..

typically derived index-paths are created on a "first seen" basis for a new file. and updated when its timestamp changes. so some util that has no knowledge of RDF is just dumping emails or blogposts or CSV files somewhere..

Received on Tuesday, 3 February 2015 21:09:18 UTC