facet filter n from carmen on 2011-02-08 (public-semweb-ui@w3.org from February 2011)

From: carmen <_@whats-your.name>
Date: Tue, 8 Feb 2011 12:07:32 +0000
To: public-semweb-ui@w3.org
Message-ID: <20110208120732.GA25215@11.Belkin>
;; This buffer is for notes

_babel
Java exceptions, server down, and currently this
{
	"items" : [
		{
			"TDATE" :         "0:00:00",
			"MOD" :           "D",
			"STATION" :       [
				"EWTN (WEWN)",
				"WEWN",
				"WEWN EWTN Catholic R.",
				"Radio Free Asia",
				"CNR1 Jammer",
				"IBB",
				"R.FARDA",
				"Radio Farda"
			],

(wong as theres only STATION per row, suppose its open source and i could install a Babel locally and try to figure it out)

_google-refine
latest snapshot, unreported parse errors, visible as entire lines or even the rest of the document appearing in single facet fieldnames.. 

wrote a TSV parser that works on the xls2txt(http://wizard.ae.krakow.pl/%7Ejb/xls2txt/) output of a XLS file from hfskeds(http://www.hfskeds.com/skeds/)

 def csv
    open(node).readlines.map{|l|l.chomp.split(/,/)}.do{|t|
      t[0].do{|x|
        t[1..-1].each_with_index{|r,ow|r.each_with_index{|v,i|
            yield '#r'+ow.to_s,x[i],v
          }}}}
  end

this is turned into an inmemory RDF/JSON graph, 

  # fromStream :: Graph -> tripleSource -> Graph
  def fromStream m,*i
    send(*i) do |s,p,o|
      m[s] ||= {'uri'=>s}
      m[s][p] ||= []
      m[s][p].push o
    end; m
  end

and finalyl to Exhibit JSON via 

 fn Render+'application/json+exhibit',->d,e{
  fields=e.q['f'].do{|f|f.split /,/}
  {items: d.values.map{|r|
      r.keys.-(['uri']).map{|k|
        f=k.frag.do{|f|(f.gsub /\W/,'').downcase} # alphanumeric id restriction 
        if !fields || (fields.member? f)
          r[f]=r[k][0].to_s # rename fieldnames, unwrap value
          r.delete k unless f==k # cleanup unless id same as before
        else
          r.delete k
        end}
      r[:label]=r.delete 'uri' # requires label only
      r
    }}.to_json}


the reason we massage the fieldnames is elucidated in this message

http://www.mail-archive.com/general@simile.mit.edu/msg01052.html

all of this is integrated into http://gitorious.org/element , drop a .tsv file in a directory ,add ?view=exhibit to querystring , get an exhibit


brought me to the next problem, browser freezing up for 90 seconds as Exhibit did something - DOM generation and facet statistics i guess

so i forget exactly what happened next but was already using dynamic stylesheets in a mail app (each replied-to line wrapped in class=quote , and span.quote {display:none} added to document to hide. it was pretty obvious this would be faster than document.getElementsByClassName('quote').forEach(function(){this.hide})

decided to take same approach to faceted filtering in browser, i have no idea if my choices r the fastest but they work and will probably do further experiments (eg, situating common facet values as innermost or outermost ala the SPARQL trick of using the smallest pattern first)

changing qs view=exhibit -> view=e

if a= isnt specified (comma-seperated list of predicate URIs) you are presented with a list, like:

http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://rdfs.org/sioc/ns#addressed_to
http://rdfs.org/sioc/ns#has_creator
http://purl.org/rss/1.0/category
[Go]

click the ones you want, [Go]

at which point, left side is filled with facet-selector panes

custom views are selected with ev=board

a convention of view/board/base
                view/board/item

where base is handed a function that it calls to put the items wrapped in special divs that the CSS will use to filter

a music player, /item draws a single playlist row:

http://blog.whats-your.name/public/smiths.png

figuring out result set is only half the battle for browser, excessive use of floats, relative sizes and so on become noticeable in huge data sets

hfskeds is 30K rows, 22 cols or .66 million triples. roughly the upper bounds of what i'd want to use, on a Netbook. takes about 5 seconds to load a doc and 0.8 second to redraw after filter change

can squeeze out faster redraw
<pre>, fixed-heights/widths, absolute positioning

shortwave schedules were main dataset so lets get into some of those

http://blog.whats-your.name/public/25m.html

#!/bin/sh
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=2200&minP=kc/s&maxP=kc/s&max=2500' > 120m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=3100&minP=kc/s&maxP=kc/s&max=3450' > 90m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=3890&minP=kc/s&maxP=kc/s&max=4000' > 75m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=4740&minP=kc/s&maxP=kc/s&max=5125' > 60m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=5800&minP=kc/s&maxP=kc/s&max=6300' > 49m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=7200&minP=kc/s&maxP=kc/s&max=7600' > 40m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=9400&minP=kc/s&maxP=kc/s&max=9999' > 31m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=11500&minP=kc/s&maxP=kc/s&max=12160' > 25m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=13500&minP=kc/s&maxP=kc/s&max=13900' > 22m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=15100&minP=kc/s&maxP=kc/s&max=15900' > 19m.html
curl 'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=17500&minP=kc/s&maxP=kc/s&max=17900' > 16m.html

created a HTML file for each band and uploaded to webserver..

as you can see a default filter exists, maxP, minP (matchP too) which is handy for common uses

custom filters to be activated via QS (comma-seperated list) can be written, eg exerpt

sort of a natural-language one, realizing any time an int < 2400 in email is probably referring to a time, and >2400 to frequency (minus a few false positives for phone numbers, years)

           m[u]={'uri' => u,
             'big'=>l.scan(/\b[A-Z][A-Z][A-Z]+\b/),
             Content=>l}
           l.scan(/\d{4,}/){|d| d=d.to_i
             if (d > 2400) && (d < 30000)
               m[u]['kc/s']=[d]
             elsif
               m[u]['BTIM']=[d];m[u]['ETIM']=[d+30]
             end}
           m.delete u unless m[u].has_keys ['BTIM','kc/s']
           )}
  
filter mutates the request-time JSON model however sees fit, adding new properties and so on..

http://blog.whats-your.name/public/GlenDoes31.html

i did a few more of these, Eibi L and H: http://blog.whats-your.name/public/eibiL.html (this is the largest one up now, data-wise)

http://blog.whats-your.name/public/bbc.html BBC

onto some other examples

/t is a lifestream (http://www.cs.yale.edu/homes/freeman/dissertation/etf.pdf) serving a time-range of resource (with options for start/end direction (Ascending/descending) and count) here filtered by source

http://i574.photobucket.com/albums/ss187/ix9/hyper/2011-01-16-203039_1366x768_scrot.png

 always add a sioc:addressed_to and sioc:creator to triple-izers for this usage


/search  examine shows us top poster is Cory Doctorow (no surprise there)

http://i574.photobucket.com/albums/ss187/ix9/hyper/to.png

i imported all boingboing posts for this one, thats discussed @ http://blog.whats-your.name/public/bb.html

a couple possibilities

hash URIs for filters. i will wait for Exhibit 3.0 to come up with their convention and use that. or just soemthing like facet=val,val2&facet2=val3,val4

visible set - jQuery has a :visible meta-selector, which i have not tried to see how fast it is. would be useful if you want to reserialize a document deleting all invisible (filtered) elements.. probably we should make noise about adding right to css as it likely has feature already eg Ctrl-F only searches visible els

"just publish RDFa" would be cool, some JS that introspects a DOM and adds the appropriate facet wrappers
-c
Received on Tuesday, 8 February 2011 12:08:34 UTC