Re: Target audience of the Direct Mapping document? from Richard Cyganiak on 2011-07-11 (public-rdb2rdf-wg@w3.org from July 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 11 Jul 2011 12:46:40 +0100
To: Alexandre Bertails <bertails@w3.org>
Cc: Eric Prud'hommeaux <eric@w3.org>, W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Message-Id: <AE417C10-E995-4E09-BDAE-1F1F87967B84@cyganiak.de>
Alexandre,

On 10 Jul 2011, at 20:18, Alexandre Bertails wrote:
> For example, if you remove the dependent part in "dereference", you
> obtain the following:
> [[
> dereference: Row → ForeignKey → Row
> ]]
> 
> What I proposed once to Eric was to do that, and to specify the
> constraints in plain English.

So you mean, the English would say something like: “The row and foreign key arguments must be from the same table”?

I'd like that.

>> Well, I'm not a mathematician but I'm a programmer, and in all my years I cannot remember having come across a notation that looked like "⟦ ⟧φtable" to define a function.
> 
> It means that you don't know what defining a semantics for a language
> really means.

That may well be true, but should it be required for being able to read the direct mapping?

>> A function, both in mathematics and in programming, usually looks like "functionname(argument,argument)".
> 
> Yes, usually. But sometimes, in some context, we change this convention
> for some reasons. That's the case here, and I'm not the one to blame.

Well, you use a notation that is appropriate when communicating to an audience of scientists interested in the formal semantics of programming languages. The notation is not appropriate when communicating to an audience of first-year CS students or domain experts without CS background.

(snip discussion of SQL datatypes -- you're right, that's a separate topic and the entire WG has some work to do there first)

> We couldn't find two consistent definitions of relational database on
> the Web.

This is why we use the standard: ISO/IEC 9075.

> I believe it was part of the job of this WG to define it clearly.

What part of the job do you not consider done?

>>> What I wanted to say is that the Direct Mapping was really designed
>>> and written with simplicity in mind. But it still has to be:
>>> 
>>> robust,
>> 
>> I don't know what that means in this context.
> 
> That it models real-life relational databases.

This is why it should use terminology from the SQL standard wherever possible.

I appreciate that it already uses SQL terminology and not relational algebra terminology like some early drafts.

>> Most of the English rules in 3.4 do not say the same as their math counterparts.
> 
> Do you want us to define the whole function? This can of course be done,
> and I'm not against that.

I'll give you an example. When you say:

[[
    ⟦r⟧φrow = let s = φ(r) in
    { (s, p, o) | (p, o) ∈ ⟦r, fk⟧φref | noNULLs(r, fk) | fk ∈ foreignKeys(table(r)) }
    ⋃ { (s, p, o) | (p, o) ∈ ⟦r, c⟧φlex | value(r, c) ≠ NULL | c ∈ lexicals(r) }
    ⋃ { (s, rdf:type, ue(tablename(table(r)))) }
]]

then what I want to see is:

[[
The ROW GRAPH for a row is an RDF graph containing the following triples:
* one FOREIGN KEY TRIPLE for each foreign key of the table, if the row's fields corresponding to the foreign key's columns are all non-null
* one DATA TRIPLE for each non-null field of the row, except for fields of a column that has a single-column foreign key defined over it
* a ROW TYPE TRIPLE for the row
]]

Where the all-caps terms expand again to definitions like:

[[
The ROW TYPE TRIPLE for a row is an RDF triple with the following components:
* subject: the ROW NODE for the row
* predicate: rdf:type
* object: the TABLE CLASS IRI for the row's table
]]

And so on. That's easy to read without having to learn or decipher any special notation, and it's precise (or can be made so). Admittedly it's more verbose by a factor of two or three.

The English version has another big advantage: One can easily look up a specific part of the mapping. Want to know how the typing of row resources worked again? Just scan the text for a definition that sounds like it could be relevant... "Row type triple" sounds good ... ok, you have your answer. In the maths version, it is absolutely not obvious that one would have to decipher the "⟦r⟧φrow" function to find the answer.

Another advantage of the English version is that it can be searched using Strg+F.

>> What is the advantage of having two normative versions?
> 
> They are the same thing.

No they are not. You *intend* them to be the same thing, but that doesn't make it so. As long as they don't *exactly* say the same thing, it's broken. Currently, both are incomplete because both are missing details.

>> For example, rule [44]. In pure imperative style, i'd expect something like:
>> 
>>    [[r, fk]]ref is defined as:
>>        let p = ⟦table(r), fk⟧col
>>        let targetRow = dereference(r, fk)
>>        let let o = φ(targetRow)
>>        return (p, o)
>> 
>> In pure functional style, I'd expect something like:
>> 
>>    [[r, fk]]ref = (p, o), where:
>>        p = ⟦table(r), fk⟧col,
>>        o = φ(targetRow),
>>        targetRow = dereference(r, fk)
> 
> There is no such thing as "pure functional style" (you can be write pure
> functions in C). And making the order of evaluation explicit or not
> doesn't matter.
> 
> We don't have to write in Haskell to pretend being functional.

Why are you talking about programming languages? We are talking about notation for communicating to a human reader.

The "let abc in xyz" notation is unlike anything I've ever seen, which is why I gave you two examples of notations that I consider more standard and familiar. The first one is plain old pseudocode. The second one is pretty much just high school mathematics (except for the angle bracket notation and greek letters).

(But this is a bit of a tangent; if accompanied by a solid English version, then it's ok if the maths version is a bit more obscure.)

>> TableName is a subtype of String. Datatype is *not* a subtype or supertype of String. It's supposed to be an enumeration type that includes "String" as one of its possible values, as far as I can tell. This is not clear from the formal version of rule [9].
> 
> Oh I see. They are not of the same kind so this can be misleading. What
> about using INT, STRING, etc. for SQL datatypes? Or another convention?

I don't have a strong opinion except that they should be different from each other.

> That's a good question: can RDB capture a view instead of a physical
> table? I believe this is the case. That's an implementation detail in
> the sense that you obtain an RDB instance in both cases. But I can be
> wrong.
...
> It just tells you that a Database is a bunch of Tables, Table being
> itself defined, and so on. How you provide these tables is up to you.

There would be no problem with treating views just like tables.

If you define the direct mapping as applying to any random collection of tables, then you also need to define how to deal with dangling foreign keys, where the other table is not included in the random collection.

I don't think that's a good idea and I would rather see the direct mapping to be defined as applying to an entire schema or catalog, as this also makes more sense from an implementation and user point of view.

>> I can answer that one. "a+b" is a notation that kids learn in school, somewhere around first or second grad, and that can be used without explanation because it is so elementary. "plus(a,b)" is less familiar, more verbose, and would actually require explanation. Therefore, obviously "a+b" is the way to go.
>> 
>> The notation you have chosen is concise, but it is unfamiliar and it requires explanation.
> 
> You don't get it. The choice for a notation is done in a particular
> context. You don't really explain the reason why we chose to teach + as
> an infix operator instead of a common prefix.

It doesn't matter why 1+1 is taught in school rather than plus(1,1). What matters is that when you write 1+1, you reach almost every person who is alive; when you write plus(1,1), you don't. What more do you need to know?

> And you don't really explain why the former is easier to understand than the later. It really depends on your public.

Look at the title of this thread. The whole reason I started this thread is because I allege that all four editors of the direct mapping document are writing for the wrong audience.

> The thing is, people working on defining semantics for programming
> languages use the square-bracket notation.

I don't care if the angle bracket notation is preferred in your circle of academic peers in some subfield of a subfield. THEY ARE NOT THE TARGET AUDIENCE OF THIS SPECIFICATION! We are writing standards for the open web, which means we have a responsibility to communicate to as wide an audience as possible.

>> It has been eight months or so since then, and to the best of my knowledge the editors have done NOTHING towards resolving the main problem of their document: that it has *four* descriptions of the same thing. That's *still* three too much.
> 
> I don't know how you count 4 :-)

One in Section 2, two in Section 3, one in Section 4.

> But I can't let you say that the "editors have done NOTHING towards
> resolving the main problem of their document [...]".

Ok, fair enough, I overstated my case here and apologize for that.

> Usually, the only answers I got is to say more or
> less implicitly that I don't understand Datalog nor FOL...

Yeah, I know how that works. You don't understand Datalog and they don't understand Denotational Semantics, and at that point you all just stopped talking to each other :-(

Can we please settle on the lowest common denominator? Plain English?

Best,
Richard
Received on Monday, 11 July 2011 11:47:10 UTC