Re: Addressing ISSUE-64 and ISSUE-65 from Richard Cyganiak on 2011-08-19 (public-rdb2rdf-wg@w3.org from August 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Fri, 19 Aug 2011 13:28:37 +0100
To: Eric Prud'hommeaux <eric@w3.org>
Cc: Juan Sequeda <juanfederico@gmail.com>, W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Message-Id: <011ED234-5AD5-4B30-A8C0-DA37159D8E96@cyganiak.de>
On 17 Aug 2011, at 23:48, Eric Prud'hommeaux wrote:
>> You say that some proposals don't play well with namespace prefixes. You use this as an argument against these proposals. I think that's an invalid argument because namespaces are *already* entirely useless with the DM.
>> 
>> 1. Each table requires its own namespace, leading to an abundance of namespaces
> 
> In the use cases I've dealt with, this has been a feature rather than a bug. That is people:ID and addrs:ID are conveniently distinguished. Writing rules or queries is very intuitive with this partitioning:
> 
>    PREFIX ppl: <People#>
>    PREFIX adr: <Addresses#>
>    SELECT ?city WHERE { ?who ppl:fname "Bob" ;
>                              ppl:addr ?addr .
>                        ?addr adr:city ?city }

Most databases don't have neat and intuitive table names like that. They have "OBX_MODEL_PPL2" and "OBX_SHP_ADR_MAIN". Once you look beyond the MySQL webapp market and look at enterprisey stuff, many database schemas aren't even hand-designed, but look like they dropped out of some CASE tool or other monstrosity. Actually coming up with a neat intuitive three-letter abbreviation for each of these tables is *hard*. It is extra work. Most users won't bother, because they can get the job done without inventing prefixes, and for fear that their neat prefix doesn't quite capture the meaning of the table (which they probably didn't design themselves and only half-understand).

>> 2. The DM is not written by humans but by machines. The machine has to generate the namespace prefix. The only thing it can really do is either use the table name, or use the (unreadable) ns0, ns1, ns2 pattern.
> 
> The DM will be queried by humans. It will also be transformed to common ontologies by rules written by humans.

So you assume that the prefixes will be written by humans?

I don't believe that.

No one wants to enter several lines of boilerplate before they can run a query. Either the processor will pre-configure the prefixes (which again raises the problem of machine-generated prefixes), or users will just make do without prefixes.

Writing the prefixes manually, on the other hand, requires an understanding of the URI scheme used by the DM. Once one has acquired that understanding, one can just as well forget about prefixes and use the URIs straight.

>> 3. Generating prefixes automatically from the table name leads to all sort of Fun with special characters. Basically, it is impossible because there are no escape mechanisms inside the *prefix*.
> 
> Most databases have only unary foreign keys.

I doubt that. Many databases don't have any foreign keys at all. And many non-toy databases *do have* foreign keys with multiple columns.

> Most of these can be tranformed to common vocabularies with rules which don't mention node identifiers, e.g.
> 
>    PREFIX ppl: <People#>
>    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
>    CONSTRUCT {
>        ?who a foaf:Person ; foaf:givenName ?fname
>    } WHERE { ?who ppl:fname ?fname }

Have you tried a non-toy example? How did that go?

>> 4. The DM already produces many IRIs that cannot be abbreviated into prefixed names because they contain commas or equal signs in the local part.
> 
> Many queries and rules don't include specific node identifiers.

It is very common for queries to ask for a specific identifier.

> For databases with only unary foreign keys, these queries can be tersely and conveniently expressed with the current algorithm.

Bullshit. Using namespaces in DM queries makes them more verbose, not more terse. Rewriting any of your examples here with relative URIs makes the queries more compact.

>> 5. Even if that's not the case, special characters in table and column names will often prevent abbreviation.
> 
> But again, most column names are BORING UPPER-CASE STRINGS (which fit the PN_LOCAL lexical pattern which SPARQL and Turtle use).

Sure, BORING_UPPER_CASE_STRING works in 80% of all cases. Nevertheless, *every* user who does any real work with the DM will be confronted with situations where prefixed names don't work, so they will:

- have to understand the relative URI approach anyway
- be confused about why sometimes the one and sometimes the other is used
- be confronted with unexpected errors when they try to use prefixed names but it doesn't work because there's some weird character in a column name
- have to learn which characters are allowed in local names, so that they know whether to use prefixed name or relative URI when writing their queries

And this is *in addition* to dealing with percent-encoding, which is confusing enough!

I repeat this for you: When working with any non-toy database, you'll have to use the relative URI approach *anyway* in at least a few instances, so users have to learn that approach *anyway*, and have to learn what the heck the difference is, and when to use which.

The relative URI approach works *always*, is *more terse*, and removes an entire layer of complexity.

>> 6. All RDF syntaxes that support prefixes, also support relative IRIs. Using the table name as a prefix is just as long as using a relative URI: People:addr vs. <People#addr>.
> 
> Perhaps it's a fault of the educators,

Perhaps.

> but I've seen a surprisingly small number of SPARQL queries using base (like < 3%).

Well, the DM is quite different from your average RDF graph, so it's not surprising that queries against the DM will look different.

BASE isn't even necessary. The processor can specify a default base.

>> 7. *All* URIs that can possibly occur in a DM graph can be nicely abbreviated with a single base URI.
>> 
>> My conclusion is that using prefixes with the DM is impossible to implement in any way that works and makes sense. Implementations that are interested in producing readable RDF should just use relative IRIs.
>> 
>> Therefore, I think you have not presented any valid arguments against using property IRIs such as these:
>> 
>>  <People,Addresses#addr,ID>
>>  <People#addr,Addresses,ID>
>>  <ref/People#addr>
>> 
>> Personally, I like the last option.
> 
> So worse than each table having its own namespace, each foreign key has again a novel namespace.

Eh. My point is that *none* of them has a namespace declaration. You know, there is no law that states you can't use <IRIs> as property names in SPARQL.

> 1 and 3 ensure that no foreign key will be in the same namespace as the other properties of the table. 2 renders many common queries like
>    PREFIX ppl: <People#>
>    PREFIX adr: <Addresses#>
>    SELECT ?city WHERE { ?who ppl:fname "Bob" ;
>                              ppl:addr ?addr .
>                        ?addr adr:city ?city }
> harder to write.

How is the above harder to write than this?

   SELECT ?city WHERE { ?who <People#fname> "Bob" ;
                             <People#addr> ?addr .
                       ?addr <Addresses#city> ?city }

> What is the justification for complicating these common cases?

I think of it as simplifying them. One less layer of complexity. More predictability. Less to learn.

> If it's just 'cause there's an exception in the spec, I don't see the trade-off justified at all. If it's to save an arc traversal { ?who ppl:addr ?addr . ?addr adr:ID ?id }, most graph patterns won't even touch db-encoding artifacts like the ID. If it's to save a SQL join, the relational schema already provides the DM processor already with sufficient info to not do the join (i.e. FOREIGN KEY (addr) REFRENCES Addresses (ID) ).

None of the above.

It's to make the DM more predictable, easier to use, easier to teach and easier to read for real-world applications.

Your entire argument hinges on the use of namespace prefixes, and since I believe that use of namespace prefixes with the DM is a bad idea, I simply don't find your argument compelling at all.

You're optimizing for the “hello, world” case at the expense of real-world usability. You're pretending that funky characters in identifiers are a rare corner case that doesn't really happen and that you don't need to worry about. I'm sorry but that doesn't work. Believe me, I've tried that approach in D2RQ and it doesn't work. Our second-most frequent class of bugs over the years has been the result of me assuming, “oh no one would ever be so stupid to put *that* character into a column name, right?”

Best,
Richard
Received on Friday, 19 August 2011 12:29:07 UTC