Re: Addressing ISSUE-64 and ISSUE-65 from Eric Prud'hommeaux on 2011-08-23 (public-rdb2rdf-wg@w3.org from August 2011)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Tue, 23 Aug 2011 11:55:43 -0400
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Juan Sequeda <juanfederico@gmail.com>, W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Message-ID: <20110823155542.GS24684@w3.org>
oops, didn't see this one.

* Richard Cyganiak <richard@cyganiak.de> [2011-08-19 13:28+0100]
> On 17 Aug 2011, at 23:48, Eric Prud'hommeaux wrote:
> >> You say that some proposals don't play well with namespace prefixes. You use this as an argument against these proposals. I think that's an invalid argument because namespaces are *already* entirely useless with the DM.
> >> 
> >> 1. Each table requires its own namespace, leading to an abundance of namespaces
> > 
> > In the use cases I've dealt with, this has been a feature rather than a bug. That is people:ID and addrs:ID are conveniently distinguished. Writing rules or queries is very intuitive with this partitioning:
> > 
> >    PREFIX ppl: <People#>
> >    PREFIX adr: <Addresses#>
> >    SELECT ?city WHERE { ?who ppl:fname "Bob" ;
> >                              ppl:addr ?addr .
> >                        ?addr adr:city ?city }
> 
> Most databases don't have neat and intuitive table names like that. They have "OBX_MODEL_PPL2" and "OBX_SHP_ADR_MAIN". Once you look beyond the MySQL webapp market and look at enterprisey stuff, many database schemas aren't even hand-designed, but look like they dropped out of some CASE tool or other monstrosity. Actually coming up with a neat intuitive three-letter abbreviation for each of these tables is *hard*. It is extra work. Most users won't bother, because they can get the job done without inventing prefixes, and for fear that their neat prefix doesn't quite capture the meaning of the table (which they probably didn't design themselves and only half-understand).

Could you give real-world examples which do break namespaces? In a fair amount of health care and life sciences modeling, I have seen everyone query databases with prefixes to make the query comprehensible. I do see namespace-poisoned graph nodes (with e.g. '%' or '$'), but very rarely on predicates.


> >> 2. The DM is not written by humans but by machines. The machine has to generate the namespace prefix. The only thing it can really do is either use the table name, or use the (unreadable) ns0, ns1, ns2 pattern.
> > 
> > The DM will be queried by humans. It will also be transformed to common ontologies by rules written by humans.
> 
> So you assume that the prefixes will be written by humans?
> 
> I don't believe that.
> 
> No one wants to enter several lines of boilerplate before they can run a query. Either the processor will pre-configure the prefixes (which again raises the problem of machine-generated prefixes), or users will just make do without prefixes.
> 
> Writing the prefixes manually, on the other hand, requires an understanding of the URI scheme used by the DM. Once one has acquired that understanding, one can just as well forget about prefixes and use the URIs straight.

Perhaps some day they won't be, but in every compliant SPARQL query I've seen, people use namespaces.


> >> 3. Generating prefixes automatically from the table name leads to all sort of Fun with special characters. Basically, it is impossible because there are no escape mechanisms inside the *prefix*.
> > 
> > Most databases have only unary foreign keys.
> 
> I doubt that. Many databases don't have any foreign keys at all. And many non-toy databases *do have* foreign keys with multiple columns.

I guess we've seen different databases. Here's an example I see lots of people using:
  mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
  use go;
  use uniProt;


> > Most of these can be tranformed to common vocabularies with rules which don't mention node identifiers, e.g.
> > 
> >    PREFIX ppl: <People#>
> >    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
> >    CONSTRUCT {
> >        ?who a foaf:Person ; foaf:givenName ?fname
> >    } WHERE { ?who ppl:fname ?fname }
> 
> Have you tried a non-toy example? How did that go?

Sure, clinical encounter data, geneome browser, lots of queries that people can run today if we don't make it too painful.


> >> 4. The DM already produces many IRIs that cannot be abbreviated into prefixed names because they contain commas or equal signs in the local part.
> > 
> > Many queries and rules don't include specific node identifiers.
> 
> It is very common for queries to ask for a specific identifier.
> 
> > For databases with only unary foreign keys, these queries can be tersely and conveniently expressed with the current algorithm.
> 
> Bullshit. Using namespaces in DM queries makes them more verbose, not more terse. Rewriting any of your examples here with relative URIs makes the queries more compact.

I'm glad to see I'm working in an environment of professional respect.

For simple, one-database queries, either relative IRIs or namespaces work. In these cases, I could just as well use SQL. In queries which actually connect different data sources on the SemWeb, I'd need to extend e.g. SPARQL, N3, Turtle, Trig, etc to handle scoped bases.


> >> 5. Even if that's not the case, special characters in table and column names will often prevent abbreviation.
> > 
> > But again, most column names are BORING UPPER-CASE STRINGS (which fit the PN_LOCAL lexical pattern which SPARQL and Turtle use).
> 
> Sure, BORING_UPPER_CASE_STRING works in 80% of all cases. Nevertheless, *every* user who does any real work with the DM will be confronted with situations where prefixed names don't work, so they will:
> 
> - have to understand the relative URI approach anyway
> - be confused about why sometimes the one and sometimes the other is used
> - be confronted with unexpected errors when they try to use prefixed names but it doesn't work because there's some weird character in a column name
> - have to learn which characters are allowed in local names, so that they know whether to use prefixed name or relative URI when writing their queries
> 
> And this is *in addition* to dealing with percent-encoding, which is confusing enough!
> 
> I repeat this for you: When working with any non-toy database, you'll have to use the relative URI approach *anyway* in at least a few instances, so users have to learn that approach *anyway*, and have to learn what the heck the difference is, and when to use which.
> 
> The relative URI approach works *always*, is *more terse*, and removes an entire layer of complexity.

Yes, they will sometimes have to write difficult queries, but why take away the way they've already learned to simplify their queries? Also, why require schema documentors to multiple documents to describe their data, and put users through the cognitive dissonance of having some n+1 schemas for a table with n foreign keys? I don't see this as simplifying the user experience at all.


> >> 6. All RDF syntaxes that support prefixes, also support relative IRIs. Using the table name as a prefix is just as long as using a relative URI: People:addr vs. <People#addr>.
> > 
> > Perhaps it's a fault of the educators,
> 
> Perhaps.
> 
> > but I've seen a surprisingly small number of SPARQL queries using base (like < 3%).
> 
> Well, the DM is quite different from your average RDF graph, so it's not surprising that queries against the DM will look different.

But the users are the same. They have a set of expectations about schemas that are the standard practice on the SemWeb.


> BASE isn't even necessary. The processor can specify a default base.
> 
> >> 7. *All* URIs that can possibly occur in a DM graph can be nicely abbreviated with a single base URI.
> >> 
> >> My conclusion is that using prefixes with the DM is impossible to implement in any way that works and makes sense. Implementations that are interested in producing readable RDF should just use relative IRIs.
> >> 
> >> Therefore, I think you have not presented any valid arguments against using property IRIs such as these:
> >> 
> >>  <People,Addresses#addr,ID>
> >>  <People#addr,Addresses,ID>
> >>  <ref/People#addr>
> >> 
> >> Personally, I like the last option.
> > 
> > So worse than each table having its own namespace, each foreign key has again a novel namespace.
> 
> Eh. My point is that *none* of them has a namespace declaration. You know, there is no law that states you can't use <IRIs> as property names in SPARQL.

You can, but it's a lot of noise and much more error prone, and operating on such queries is like eating spaghetti with chopsticks.


> > 1 and 3 ensure that no foreign key will be in the same namespace as the other properties of the table. 2 renders many common queries like
> >    PREFIX ppl: <People#>
> >    PREFIX adr: <Addresses#>
> >    SELECT ?city WHERE { ?who ppl:fname "Bob" ;
> >                              ppl:addr ?addr .
> >                        ?addr adr:city ?city }
> > harder to write.
> 
> How is the above harder to write than this?
> 
>    SELECT ?city WHERE { ?who <People#fname> "Bob" ;
>                              <People#addr> ?addr .
>                        ?addr <Addresses#city> ?city }
> 
> > What is the justification for complicating these common cases?
> 
> I think of it as simplifying them. One less layer of complexity. More predictability. Less to learn.

But you have to complicate not only the schema, but the DM spec itself. Compare the unary foreign key exception

"and where that column is NOT the sole column in any foreign key" http://www.w3.org/2001/sw/rdb2rdf/directMapping/EGP#defn-row%20graph

against having two types of property identifiers:

http://www.w3.org/2001/sw/rdb2rdf/directMapping/explicitFK#defn-literal%20property%20IRI
http://www.w3.org/2001/sw/rdb2rdf/directMapping/explicitFK#defn-reference%20property%20IRI

> > If it's just 'cause there's an exception in the spec, I don't see the trade-off justified at all. If it's to save an arc traversal { ?who ppl:addr ?addr . ?addr adr:ID ?id }, most graph patterns won't even touch db-encoding artifacts like the ID. If it's to save a SQL join, the relational schema already provides the DM processor already with sufficient info to not do the join (i.e. FOREIGN KEY (addr) REFRENCES Addresses (ID) ).
> 
> None of the above.
> 
> It's to make the DM more predictable, easier to use, easier to teach and easier to read for real-world applications.

But what are the use cases which are going to trip up users? In particular, what leads the user to consider key values as properties of the referring row instead of the referred row?

> Your entire argument hinges on the use of namespace prefixes, and since I believe that use of namespace prefixes with the DM is a bad idea, I simply don't find your argument compelling at all.
> 
> You're optimizing for the “hello, world” case at the expense of real-world usability. You're pretending that funky characters in identifiers are a rare corner case that doesn't really happen and that you don't need to worry about. I'm sorry but that doesn't work. Believe me, I've tried that approach in D2RQ and it doesn't work. Our second-most frequent class of bugs over the years has been the result of me assuming, “oh no one would ever be so stupid to put *that* character into a column name, right?”

Sure, this happens, '$'s and '%'s are a total pain for me as well, but for most of the intentionally-designed relational schemas, even ones with Кириллица or 漢字, the current scheme is simple and consistent with user expectations. I see this proposal making the spec more complicated and the schemas inconsistent with SemWeb practices. I don't want to write some data format which needs to be operated with chopsticks when conventional silverware will do.


> Best,
> Richard

-- 
-ericP
Received on Tuesday, 23 August 2011 15:56:15 UTC