separators in IRIs from Eric Prud'hommeaux on 2011-09-12 (public-rdb2rdf-wg@w3.org from September 2011)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Mon, 12 Sep 2011 09:32:44 -0400
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>, timbl@w3.org, Mark Nottingham <mnot@mnot.net>, yves@w3.org, Lee Feigenbaum <lee@cambridgesemantics.com>
Cc: Richard Cyganiak <richard@cyganiak.de>, public-rdb2rdf-wg@w3.org
Message-ID: <20110912133243.GA15873@w3.org>
We're working on a spec which maps relational databases to RDF, assigning IRIs to the rows, e.g.

┌┤People├─────────┬─────────┐
│┌pk┐│            │         │
│ ID │   fname    │  lname  │
│  1 │      "Bob" │ "Smith" │
│  2 │  "Madonna" │      "" │
│  3 │     "T in" │ "Ya-Li" │
│  4 │ "أكرم.عبد" │   "كور" │
└────┴────────────┴─────────┘

The issue is what should be the row identifier for e.g. row 1 and how
much does 3987 constrain us. The most straight-forward ID is probably
<…People/ID=1>, or something like <…People/fname=Bob,lname=Smith> if
the primary key were multi-column (e.g. on fname, lname). The problem
with that is that we can't assign a common prefix for the row IDs,
e.g. xmlns:ppl="…People/", because the right sides have non-localname
characters, e.g. ppl:fname=Bob,lname=Smith . We can get around this
with relative paths and by writing non-abbreviated IRIs, but it's a
pain. The question is: is it a needless pain?

If we choose other separators, e.g. ('.' '-') in place of (',' '=')
and escape only what we need (the next delimiter), we assign pick
prefixes which work for many more data, e.g.
ppl:fname-Bob.lname-Smith , ppl:fname-Madonna.lname- , but not
<…People/fname-T+in.lname-Ya-Li> (' ' encodes '+' (or %20))
<…People/fname-أكرم%2Eعبد-كور.lname-كور> ('.' encodes as "%2E").

Some (I'd argue, many) lives are imprived by the conservative
encoding, but as Richard points out below, it ignores 3987's
deference to 3986's reserved delimiters:
  http://tools.ietf.org/html/rfc3986#section-2.2

Apart from having a standard url-encoding function (which we can't use
because names like プルドモー don't need encoding), what's the advised
use of these delimiters?


* Richard Cyganiak <richard@cyganiak.de> [2011-09-12 11:20+0200]
> On 10 Sep 2011, at 18:33, Eric Prud'hommeaux wrote:
> >> The normative reference here is the URI spec (RFC 3986 [1]). It says:
> >> 
> >>   A URI is composed from a limited set of characters consisting of
> >>   digits, letters, and a few graphic symbols.  A reserved subset of
> >>   those characters may be used to delimit syntax components within a
> >>   URI while the remaining characters, including both the unreserved set
> >>   and those reserved characters not acting as delimiters, define each
> >>   component's identifying data.
> > 
> > I believe that we're using IRIs, not URIs (a syntactic subset of
> > IRIs).
> 
> That's correct, and you're right to point out that built-in functions designed for URIs don't handle non-ASCII characters in the right way.
> 
> However, the normative reference here (RFC 3987) says:
> 
>    IRIs are defined similarly to URIs in [RFC3986], but the class of
>    unreserved characters is extended by adding the characters of the
>    UCS beyond U+007F […]
> 
>    Otherwise, the syntax and use of components and reserved characters
>    is the same as that in [RFC3986].
> 
> Also note:
> 
>    Characters outside the US-ASCII repertoire are not reserved and
>    therefore MUST NOT be used for syntactical purposes, such as to
>    delimit components in newly defined schemes. […] This is similar
>    to the fact that it is not possible to use '-' as a delimiter in
>    URIs, because it is in the 'unreserved' category.
> 
> In summary, using non-delimiters as delimiters is just as inappropriate in IRIs as in URIs.
> 
> Best,
> Richard
> 
> 
> 
> > I don't see any reason not to limit ourselves to URIs, and lots
> > of reasons to use IRIs.
> > 
> >  RDF, SPARQL, OWL, etc, are defined in terms of IRIs.
> > 
> >  URIs are truly unpleasant for non-ascii scripts:
> >    <テーブル名/属性名¹-属性の値¹.属性名ⁿ-属性の値ⁿ>
> >    vs.
> >    <%E3%83%86%E3%83%BC%E3%83%96%E3%83%AB%E5%90%8D%2F%E5%B1%9E%E6%80%A7%E5%90%8D%C2%B9-%E5%B1%9E%E6%80%A7%E3%81%AE%E5%80%A4%C2%B9.%E5%B1%9E%E6%80%A7%E5%90%8D%E2%81%BF-%E5%B1%9E%E6%80%A7%E3%81%AE%E5%80%A4%E2%81%BF>
> > 
> >  The world has gotten used to IRIs in e.g. location bars.
> > 
> > 
> >> It goes on to define:
> >> 
> >>      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
> >> 
> >>      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
> >>                  / "*" / "+" / "," / ";" / "="
> >> 
> >> The separation of characters into general delimiter, sub-delimiters and non-delimiters is a very deliberate design.
> >> 
> >> Every modern programming language has a function that %-escapes the delimiters and leaves the non-delimiters unencoded. These off-the-shelf functions are what developers and toolkits use when escaping a string for use in a part of a URI.
> > 
> > Using IRIs already precludes using a generic URL-encode function as these functions expand all non-ascii chars into 2 to 4 %xx pairs.
> > I believe we already had consensus around a minimally-encoding subset of URL-encode.
> > 
> > 
> >> The functions can't be used with a proposal that abuses a non-delimiter character as a de facto delimiter. This means custom escaping mechanisms have to be built for things to work correctly. This is likely going to cause significant costs and bugs and interoperability failures.
> >> 
> >> Prefixed names in Turtle and SPARQL can't contain delimiters because they were designed to use the last syntactical sub-part only as the local name. I understand that this design decision is inconvenient for your use case, but this ship has sailed years ago. You're asking something from prefixed names that they were not designed for. Trying to work around it by trifling with RFC 3986 and abusing nondelims as delimiters is not an acceptable answer.
> >> 
> >> I therefore strongly suggest that the direct mapping use delimiters from the sub-delims set, both for row IRIs and for reference IRIs.
> >> 
> >> Best,
> >> Richard
> >> 
> >> 
> >> [1] http://tools.ietf.org/html/rfc3986#section-2.2
> >> 
> >> 
> >> 
> >> On 8 Sep 2011, at 22:34, Eric Prud'hommeaux wrote:
> >> 
> >>> During the last meeting, we discussed picking a punctuation schema but
> >>> asking the community for feedback on picking from a set of choices
> >>> (perfectly legit in an LC document). This can help us pick:
> >>> 
> >>> 
> >>> = Problem =
> >>> Define rules which create unambiguous identifiers for database rows,
> >>> columns and references (foreign keys).
> >>> Extra credit if they are easy to parse by human or machine and easy
> >>> to express in SPARQL, Turtle, RIF, RDF/XML ("STRR" below).
> >>> 
> >>> These URIs are composed from table and attribute names, attribute
> >>> values, and miscelaneous punctuation. This email is about tweaking
> >>> the punctuation to get the most simplicity in the most use cases.
> >>> 
> >>> Rules in in <http://www.w3.org/2001/sw/rdb2rdf/directMapping/explicitFK>:
> >>> Row IRI: base + table + '/' + attr¹ + '-' + val¹ + '.' … attrⁿ + '-' + valⁿ
> >>> Column IRI: base + table + '#' + attr
> >>> Reference IRI: base + table + '#' + 'ref-' + attr¹ + '.' … attrⁿ
> >>> 
> >>> This uses the '-' separator between attributes in both row IRIs and
> >>> reference IRIs. The attrⁿ/valⁿ separator is '.' (for simplicity in
> >>> STRR). Outlining some popular choices:
> >>> 
> >>>        row IRI              ref IRI
> >>> ① attr¹-val¹.attrⁿ-valⁿ   ref-attr¹.attrⁿ
> >>> ② attr¹.val¹-attrⁿ.valⁿ   ref-attr¹-attrⁿ
> >>> ③ attr¹-val¹.attrⁿ-valⁿ   ref-attr¹-attrⁿ
> >>> ④ attr¹=val¹,attrⁿ=valⁿ   ref-attr¹-attrⁿ
> >>> ⑤ attr¹.val¹.attrⁿ.valⁿ   ref.attr¹.attrⁿ
> >>> 
> >>> 
> >>> = Examples =
> >>> Given some tables with PKs:
> >>> ┌┤Simple├────┬───────┐  ┌┤People├────┬─────────┐  ┌┤Events├────┬────────────┬─────────┐
> >>> │┌pk┐│       │       │  │┌──────────pk────────┐│  │┌────pk────┐│┌─────↬People.pk─────┐│
> >>> │ PK │ attrA │ attrB │  │   fname    │  lname  │  │    date    │    orgfn   │  orgln  │
> >>> │  1 │ valA1 │ valB2 │  │      "Bob" │ "Smith" │  │ 2012-01-01 │      "Bob" │ "Smith" │
> >>> │  2 │ valA2 │ valB2 │  │  "Madonna" │      "" │  │ 2011-12-25 │  "Madonna" │      "" │
> >>> └────┴───────┴───────┘  │     "T in" │ "Ya-Li" │  │ 2012-04-06 │     "T in" │ "Ya-Li" │
> >>>                         │ "أكرم.عبد" │   "كور" │  │ 2011-10-01 │ "أكرم.عبد" │   "كور" │
> >>>                         └────────────┴─────────┘  └────────────┴────────────┴─────────┘
> >>> 
> >>> ┤Simple├ has your run-of-the-mill integer primary key and alphanumeric
> >>> attribute names and values. ┤People├ and ┤Events├ have alphanum attribute
> >>> names. (Attribute names which are not exclusively alpha-numeric are
> >>> horrible no matter what; they don't help us descriminate our options.)
> >>> 
> >>> == Example Row IRIs ==
> >>> We see these Row IRIs (eliding <base + ...>) for the first rows of
> >>> these tables, given the choices of punctuation listed above.
> >>> 
> >>> ①  Simple/PK-1 │ People/fname-Bob.lname-Smith │ Events/date-2012-01-01
> >>> ②  Simple/PK.1 │ People/fname.Bob-lname.Smith │ Events/date.2012%2D01%2D01
> >>> ③  Simple/PK.1 │ People/fname.Bob-lname.Smith │ Events/date.2012%2D01%2D01
> >>> ④  Simple/PK=1 │ People/fname=Bob,lname=Smith │ Events/date=2012-01-01
> >>> ⑤  Simple/PK.1 │ People/fname.Bob.lname.Smith │ Events/date.2012-01-01
> >>> 
> >>> == Reference (predicate) IRIs ==
> >>> Reference (predicate) IRIs for ┤Simple├ are simple and boring: table#ref-attr .
> >>> ┤Events├'s references to ┤People├ take to two attributes:
> >>> 
> >>> ①  Events/ref-orgfn.orgln
> >>> ②  Events/ref-orgfn-orgln
> >>> ③  Events/ref-orgfn-orgln
> >>> ④  Events/ref-orgfn-orgln
> >>> ⑤  Events/ref.orgfn.orgln
> >>> 
> >>> 
> >>> = What needs escaping =
> >>> The character used to separate attr/value pairs dictates which
> >>> characters require escaping in values. ②③ require escaping '-'s;
> >>> ①⑤ requires escaping '.'s and ④ requires escaping ','s. Row
> >>> identifiers for rows 3 and 4 of ┤People├ illustrate this:
> >>> 
> >>> ①  People/fname-T%20in.lname-Ya-Li   │ People/fname-أكرم%2Dعبد.lname-كور
> >>> ②  People/fname.T%20in-lname.Ya%2DLi │ People/fname.أكرم.عبد-lname%2Dكور
> >>> ③  People/fname.T%20in-lname.Ya%2DLi │ People/fname.أكرم.عبد-lname%2Dكور
> >>> ④  People/fname=T%20in,lname=Ya-Li   │ People/fname=أكرم.عبد,lname=كور
> >>> ⑤  People/fname.T%20in.lname.Ya-Li   │ People/fname.أكرم%2Dعبد.lname.كور
> >>> 
> >>> (We can also follow the HTML5, WSDL, ... url-encoding spec and
> >>> turn ' ' into '+' instead of '%2D'.)
> >>> 
> >>> 
> >>> = SPARQL, Turtle, RIF, RDF/XML =
> >>> RDF Rules (RIF BLD, SPARQL CONSTRUCT) generally express patterns over
> >>> predicates, without having to identify Row IRIs. Queries include Row
> >>> identifiers a bit more (the savvy user or tool will select an entity
> >>> by identifier rather than distinguishing attributes) and Turtle (the
> >>> data) will of course include both.
> >>> 
> >>> All of these languages allow the use of relative IRIs and prefixed
> >>> names. A prefixed query of a People table for ① looks like:
> >>> 
> >>> PREFIX pplinst: <http://hr.myco.example/2011/schemas/People/>
> >>> PREFIX pplschm: <http://hr.myco.example/2011/schemas/People#>
> >>> SELECT ?event
> >>>  WHERE {
> >>>    pplinst:fname-Bob.lname-Smith pplschm:atEvent ?event
> >>>  }
> >>> 
> >>> And the relative IRI query looks like:
> >>> 
> >>> BASE <http://hr.myco.example/2011/schemas/>
> >>> SELECT ?event
> >>>  WHERE {
> >>>    <People/fname-Bob.lname-Smith> <People#atEvent> ?event
> >>>  }
> >>> 
> >>> Extending the use case to gain some SemWeb utility, we join two
> >>> databases, those of the HR and catering departments:
> >>> 
> >>> PREFIX pplinst: <http://hr.myco.example/2011/schemas/People/>
> >>> PREFIX pplschm: <http://hr.myco.example/2011/schemas/People#>
> >>> PREFIX cater: <http://hr.myco.example/2011/schemas/People#>
> >>> SELECT ?start ?end
> >>>  WHERE {
> >>>    pplinst:fname-Bob.lname-Smith pplschm:atEvent ?event
> >>>    ?event cater:start ?start ; cater:end ?end
> >>>  }
> >>> 
> >>> The customary URI escape character, '%', is not permitted in prefixed
> >>> names (nor are ',' and '='). The various row ID schemas have different
> >>> impacts on the expressivity in prefixed names given different values:
> >>> 
> >>>        row ID            pos int   neg int   alphanum   date   float
> >>> ① attr¹-val¹.attrⁿ-valⁿ       ✓         ✓         ✓        ✓
> >>> ② attr¹.val¹-attrⁿ.valⁿ       ✓                   ✓                ✓
> >>> ④ attr¹=val¹,attrⁿ=valⁿ
> >>> ⑤ attr¹.val¹.attrⁿ.valⁿ       ✓         ✓         ✓        ✓
> >>> 
> >>> (③ varies from ① only in the reference IRIs)
> >>> 
> >>> For an example of negative integer primary keys, this table uses -2
> >>> and -1 to represent a couple access control groups common to all
> >>> apache servers:
> >>> 
> >>> ┌┤AccessRoles├───────┐
> >>> │┌pk┐│               │
> >>> │ ID │  desc         │
> >>> │ -2 │ "known users" │
> >>> │ -1 │       "world" │
> >>> │  1 │   "marketing" │
> >>> │  2 │  "management" │
> >>> └────┴───────────────┘
> >>> 
> >>> 
> >>> = The balance =
> >>> I see us as pushing a slider around between optimizing between
> >>> readability ("attr¹=val¹,attrⁿ=valⁿ") and usability (being able to
> >>> write/query the data with prefixed names). As Richard points out, we
> >>> can write/query the data for an individual database using an @base
> >>> directive and relative IRIs. This choice helps users write
> >>> data/queries as prefixed names (e.g. queries connecting multiple
> >>> databases).
> >>> 
> >>> IMO, ④ is the most readable and ⑤ is the most usable, with ① being my
> >>> idea of the sweet spot. ⑤ gives us the simplest encoding rules and ②
> >>> is less likely to be confused with the '.' addressing scheme used in
> >>> SQL.
> >>> 
> >>> -- 
> >>> -ericP
> >>> 
> >> 
> > 
> > -- 
> > -ericP
> > 
> 

-- 
-ericP
Received on Monday, 12 September 2011 13:33:26 UTC