- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Thu, 8 Sep 2011 16:34:33 -0400
- To: public-rdb2rdf-wg@w3.org
During the last meeting, we discussed picking a punctuation schema but asking the community for feedback on picking from a set of choices (perfectly legit in an LC document). This can help us pick: = Problem = Define rules which create unambiguous identifiers for database rows, columns and references (foreign keys). Extra credit if they are easy to parse by human or machine and easy to express in SPARQL, Turtle, RIF, RDF/XML ("STRR" below). These URIs are composed from table and attribute names, attribute values, and miscelaneous punctuation. This email is about tweaking the punctuation to get the most simplicity in the most use cases. Rules in in <http://www.w3.org/2001/sw/rdb2rdf/directMapping/explicitFK>: Row IRI: base + table + '/' + attr¹ + '-' + val¹ + '.' … attrⁿ + '-' + valⁿ Column IRI: base + table + '#' + attr Reference IRI: base + table + '#' + 'ref-' + attr¹ + '.' … attrⁿ This uses the '-' separator between attributes in both row IRIs and reference IRIs. The attrⁿ/valⁿ separator is '.' (for simplicity in STRR). Outlining some popular choices: row IRI ref IRI ① attr¹-val¹.attrⁿ-valⁿ ref-attr¹.attrⁿ ② attr¹.val¹-attrⁿ.valⁿ ref-attr¹-attrⁿ ③ attr¹-val¹.attrⁿ-valⁿ ref-attr¹-attrⁿ ④ attr¹=val¹,attrⁿ=valⁿ ref-attr¹-attrⁿ ⑤ attr¹.val¹.attrⁿ.valⁿ ref.attr¹.attrⁿ = Examples = Given some tables with PKs: ┌┤Simple├────┬───────┐ ┌┤People├────┬─────────┐ ┌┤Events├────┬────────────┬─────────┐ │┌pk┐│ │ │ │┌──────────pk────────┐│ │┌────pk────┐│┌─────↬People.pk─────┐│ │ PK │ attrA │ attrB │ │ fname │ lname │ │ date │ orgfn │ orgln │ │ 1 │ valA1 │ valB2 │ │ "Bob" │ "Smith" │ │ 2012-01-01 │ "Bob" │ "Smith" │ │ 2 │ valA2 │ valB2 │ │ "Madonna" │ "" │ │ 2011-12-25 │ "Madonna" │ "" │ └────┴───────┴───────┘ │ "T in" │ "Ya-Li" │ │ 2012-04-06 │ "T in" │ "Ya-Li" │ │ "أكرم.عبد" │ "كور" │ │ 2011-10-01 │ "أكرم.عبد" │ "كور" │ └────────────┴─────────┘ └────────────┴────────────┴─────────┘ ┤Simple├ has your run-of-the-mill integer primary key and alphanumeric attribute names and values. ┤People├ and ┤Events├ have alphanum attribute names. (Attribute names which are not exclusively alpha-numeric are horrible no matter what; they don't help us descriminate our options.) == Example Row IRIs == We see these Row IRIs (eliding <base + ...>) for the first rows of these tables, given the choices of punctuation listed above. ① Simple/PK-1 │ People/fname-Bob.lname-Smith │ Events/date-2012-01-01 ② Simple/PK.1 │ People/fname.Bob-lname.Smith │ Events/date.2012%2D01%2D01 ③ Simple/PK.1 │ People/fname.Bob-lname.Smith │ Events/date.2012%2D01%2D01 ④ Simple/PK=1 │ People/fname=Bob,lname=Smith │ Events/date=2012-01-01 ⑤ Simple/PK.1 │ People/fname.Bob.lname.Smith │ Events/date.2012-01-01 == Reference (predicate) IRIs == Reference (predicate) IRIs for ┤Simple├ are simple and boring: table#ref-attr . ┤Events├'s references to ┤People├ take to two attributes: ① Events/ref-orgfn.orgln ② Events/ref-orgfn-orgln ③ Events/ref-orgfn-orgln ④ Events/ref-orgfn-orgln ⑤ Events/ref.orgfn.orgln = What needs escaping = The character used to separate attr/value pairs dictates which characters require escaping in values. ②③ require escaping '-'s; ①⑤ requires escaping '.'s and ④ requires escaping ','s. Row identifiers for rows 3 and 4 of ┤People├ illustrate this: ① People/fname-T%20in.lname-Ya-Li │ People/fname-أكرم%2Dعبد.lname-كور ② People/fname.T%20in-lname.Ya%2DLi │ People/fname.أكرم.عبد-lname%2Dكور ③ People/fname.T%20in-lname.Ya%2DLi │ People/fname.أكرم.عبد-lname%2Dكور ④ People/fname=T%20in,lname=Ya-Li │ People/fname=أكرم.عبد,lname=كور ⑤ People/fname.T%20in.lname.Ya-Li │ People/fname.أكرم%2Dعبد.lname.كور (We can also follow the HTML5, WSDL, ... url-encoding spec and turn ' ' into '+' instead of '%2D'.) = SPARQL, Turtle, RIF, RDF/XML = RDF Rules (RIF BLD, SPARQL CONSTRUCT) generally express patterns over predicates, without having to identify Row IRIs. Queries include Row identifiers a bit more (the savvy user or tool will select an entity by identifier rather than distinguishing attributes) and Turtle (the data) will of course include both. All of these languages allow the use of relative IRIs and prefixed names. A prefixed query of a People table for ① looks like: PREFIX pplinst: <http://hr.myco.example/2011/schemas/People/> PREFIX pplschm: <http://hr.myco.example/2011/schemas/People#> SELECT ?event WHERE { pplinst:fname-Bob.lname-Smith pplschm:atEvent ?event } And the relative IRI query looks like: BASE <http://hr.myco.example/2011/schemas/> SELECT ?event WHERE { <People/fname-Bob.lname-Smith> <People#atEvent> ?event } Extending the use case to gain some SemWeb utility, we join two databases, those of the HR and catering departments: PREFIX pplinst: <http://hr.myco.example/2011/schemas/People/> PREFIX pplschm: <http://hr.myco.example/2011/schemas/People#> PREFIX cater: <http://hr.myco.example/2011/schemas/People#> SELECT ?start ?end WHERE { pplinst:fname-Bob.lname-Smith pplschm:atEvent ?event ?event cater:start ?start ; cater:end ?end } The customary URI escape character, '%', is not permitted in prefixed names (nor are ',' and '='). The various row ID schemas have different impacts on the expressivity in prefixed names given different values: row ID pos int neg int alphanum date float ① attr¹-val¹.attrⁿ-valⁿ ✓ ✓ ✓ ✓ ② attr¹.val¹-attrⁿ.valⁿ ✓ ✓ ✓ ④ attr¹=val¹,attrⁿ=valⁿ ⑤ attr¹.val¹.attrⁿ.valⁿ ✓ ✓ ✓ ✓ (③ varies from ① only in the reference IRIs) For an example of negative integer primary keys, this table uses -2 and -1 to represent a couple access control groups common to all apache servers: ┌┤AccessRoles├───────┐ │┌pk┐│ │ │ ID │ desc │ │ -2 │ "known users" │ │ -1 │ "world" │ │ 1 │ "marketing" │ │ 2 │ "management" │ └────┴───────────────┘ = The balance = I see us as pushing a slider around between optimizing between readability ("attr¹=val¹,attrⁿ=valⁿ") and usability (being able to write/query the data with prefixed names). As Richard points out, we can write/query the data for an individual database using an @base directive and relative IRIs. This choice helps users write data/queries as prefixed names (e.g. queries connecting multiple databases). IMO, ④ is the most readable and ⑤ is the most usable, with ① being my idea of the sweet spot. ⑤ gives us the simplest encoding rules and ② is less likely to be confused with the '.' addressing scheme used in SQL. -- -ericP
Received on Thursday, 8 September 2011 20:35:05 UTC