Re: Target audience of the Direct Mapping document? from Alexandre Bertails on 2011-07-10 (public-rdb2rdf-wg@w3.org from July 2011)

From: Alexandre Bertails <bertails@w3.org>
Date: Sun, 10 Jul 2011 15:18:26 -0400
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Eric Prud'hommeaux <eric@w3.org>, W3C RDB2RDF <public-rdb2rdf-wg@w3.org>
Message-ID: <1310325506.14568.114.camel@simplet>
Richard,

> 3.4 is defined in terms of, quote, “higher-order functions parameterized by a function φ”. This goes a bit beyond simple “functions and datatypes”, and I don't understand why it's necessary to specify the direct mapping.
> 
> > * syntax for the _dependent types_: this may be the tricky part and I
> > once proposed Eric to erase the dependent part, keep the raw type and
> > put the extra information in the English text after each definition. We
> > agreed that we should wait for the WG to read and make a decision
> > instead.
> 
> I'm not sure what exactly dependent types are or what the difference between the two options would be. Could you give an example or two for both options?

First of all, the wikipedia definition (which is good):
[[
a dependent type is a type that depends on a value.
]]

For example, if you remove the dependent part in "dereference", you
obtain the following:
[[
dereference: Row → ForeignKey → Row
]]

What I proposed once to Eric was to do that, and to specify the
constraints in plain English. We've just decided to wait for someone to
review that stuff.

> 
> > * syntax for _functions_: any mathematician or programmer can read
> > functions.
> 
> Well, I'm not a mathematician but I'm a programmer, and in all my years I cannot remember having come across a notation that looked like "⟦ ⟧φtable" to define a function.

It means that you don't know what defining a semantics for a language
really means. There is no way you would have missed the ⟦ ⟧ notation
otherwise. It's even in the Wikipedia page for denotational semantics,
and I don't know any counter-example of not using it in any formal
semantics I've been given to read.

Also, I've never said it was a good idea to read Section 3 without the
maths version. This is probably a bad idea anyway.

>  A function, both in mathematics and in programming, usually looks like "functionname(argument,argument)".

Yes, usually. But sometimes, in some context, we change this convention
for some reasons. That's the case here, and I'm not the one to blame.

> 
> > We've followed several advices to use the set-notation (ala
> > Python and other languages with list-comprehension) to generates, while
> > the first versions were using a monadic notation. As it's all about
> > generating value, I refused to use any iterator-based approach to define
> > this part.
> 
> What was the reason against using plain English?

I've never said that. The iterator approach is usually used by people
using a pseudo-language while defining an algorithm, usually with a
for-loop.

> 
> >> What is a "common SQL datatype"?
> > 
> > I'm not sure where you found this text.
> 
> in rule [9].

It's anything you want, as long as you able to map that correctly. I
totally agree, this is not perfect, but we've not yet decided what to do
there.

> 
> > Currently, the "common SQL
> > datatype"s are defined in [1]:
> > [[
> > Datatype    ::=   Int  |  Float  |  Date  |  …
> > ]]
> 
> To be precise, this defines the term "Datatype" in the formal notation. In the equivalent English description, it defines "Datatype" as "a common SQL datatype". I do not know what constitutes a "common SQL datatype", and your document doesn't tell me.

Ah, see above.

> 
> > You started an interesting thread on the subject. I don't think that
> > being exhaustive is achievable for this question (because of all the
> > different implementations). I actually don't think that we want that
> > either, but I'll be happy to update the related stuff when the WG will
> > have decided what to do there.
> 
> The goal is to enable interoperable implementations. The handwaving you do at the moment in [9] is not good enough for that.

I'm just saying that interoperability will be hard to achieve for this
matter, if possible. That's why I'm a bit reluctant to go beyond that.
But I've already said I'll be happy to change that with whatever the WG
decides.

> 
> >> What is a "lexical value"?
> > 
> > Where do you find this text? I don't understand the exact context.
> 
> Rule [6]. It is used to define CellValue and has no definition of its own.

Ahhhh, my bad. I rely too much on xmlspec.xsl to show me the errors...
Will be fixed.

> 
> (You do know how to search for some text in a web page?)
> 
> >> What is a "candidate key"?
> > 
> > Formally defined at [2]:
> > [[
> > CandidateKey    ::=   List(ColumnName)
> > ]]
> > 
> > The corresponding English text is:
> > [[
> > A candidate key is made of a list of columns (their order matters).
> > ]]
> 
> But that is not the definition of a candidate key! Not any old list of column names is a candidate key!
> 
> Quoting Wikipedia: “In the relational model of databases, a candidate key of a relation is a minimal superkey for that relation; that is, …”

We couldn't find two consistent definitions of relational database on
the Web. I believe it was part of the job of this WG to define it
clearly. Please see with Eric about the choices we made.

> 
> >>> and you clearly don't need a PhD to understand them.
> >> 
> >> Alexandre, I didn't ask about PhDs.
> >> 
> >> I asked about first-year students
> > 
> > It's for anybody who can read English. I believe it's enough to
> > understand the whole section, without reading the maths at all.
> 
> That is such blatant nonsense that it's close to being insulting.
> 
> Hide the maths and read Section 3.4. It doesn't make any sense at all. It's just a string of phrases that don't connect.

We can do better, for sure. But the goal remains the same.

> 
> > The "maths" just makes it's easier to proof-read and is for people who
> > understands what a function is. I believe this is the case for most
> > "first-year students".
> 
> See my comments above regarding notation.
> 
> > What I wanted to say is that the Direct Mapping was really designed
> > and written with simplicity in mind. But it still has to be:
> > 
> > robust,
> 
> I don't know what that means in this context.

That it models real-life relational databases.

> 
> > correct,
> 
> It is not correct at the moment. For example, it's unable to produce RDF graphs conforming to the RDF specification.

It's up to what you call correct. I'll come back on that below.

> 
> > exhaustive,
> 
> I don't know what that means in this context.

That the Direct Mapping is defined for any RDB instance.

> 
> > understandable,
> 
> A plain English expression would do a much better job at that.
> 
> > practical,
> 
> Ditto.
> 
> > usable.
> 
> Ditto.

Be my guest.

> 
> > The math and the English text are made equivalent *on purpose*
> > everywhere. I'd rather like to keep it this way for consistency.
> 
> Read the bloody thing! They are not equivalent at all!
> 
> Are these the same? No -- the formal version doesn't even mean *anything*.
> 
>   ue: String → String
>   An URL encoding per WSDL urlEncoded.
> 
> Are these the same? No -- one requires an "SQL string" (what's that?), the other doesn't.
> 
>   lexicalForm ::= a Unicode String
>   SQL string representing a value.
> 
> Most of the English rules in 3.4 do not say the same as their math counterparts.

Do you want us to define the whole function? This can of course be done,
and I'm not against that.

> 
> > I'd prefer that we had a discussion about showing the English version by
> > default or not.
> > 
> > Both version are intended to be normative, with the same level of
> > importance.
> 
> > 
> > Eric and I still disagree on what to display by default and we hope that
> > the WG will take the action to decide what to do there.
> 
> What is the advantage of having two normative versions?

They are the same thing. I don't think that the explanatory text can be
considered as another version, even if it says the same thing.

> 
> It makes the document harder to review, and introduces plenty of opportunities for inconsistencies to creep into the normative part.
> 
> I would prefer to make the English version complete, and remove the maths version.

If I have to choose (and forgetting what I wrote above), I'd keep the
maths version.

> 
> >> 5. A proper reference for "IWD 9075-14:2011(E)", Google can't find it
> > 
> > Eric told me this wasn't that easy. Eric, the ball is in your camp :-)
> 
> Maybe a reference to ISO/IEC 9075-14:2008 will do? We already reference parts 1 and 2 of SQL 2008 in R2RML.
> 
> >> 6. An account of how row IRIs and row blank nodes are created (maybe I'm just stupid but I can't find it anywhere in Section 3)
> > 
> > This is not said on purpose, but I agree that we should say why.
> > 
> > Depending on who you are, you can do different things with this
> > function. For example, 1. an database-vendor has access to more
> > information (eg. the row-id) and can do a more effective mapping. 2.
> > Some SQL implementations provide functions to do the same. 3. Some SQL
> > implementations don't give access to this information.
> 
> It needs to be defined, because otherwise how would we ever get two interoperable implementations that generate the same graph for the same database? How would you write even a single test case?

I see your point and agree with you.

Eric has a very simple answer to that: don't use anything that is not
given to the user and reject database-specific functions.

> 
> I think the way this is handled in Section 2 is perfectly appropriate.
> 
> See also the RESOLUTION on row identifiers here:
> http://www.w3.org/2011/03/01-rdb2rdf-minutes.html
> 
> My understanding is that working group resolutions are binding for the editors.
> 
> (There are more open issues raised against the direct mapping that need to be resolved before last call.)

+1

> 
> > It doesn't change the spirit of the Direct Mapping.
> 
> Appeals to some “spirit” are misplaced. Implementations either conform or they don't; they either are compatible or not; there is no compatibility “in spirit”. The document has to define the criteria for a conforming imlementation.

Gotcha.

> 
> >> 7. An account of what the syntax "let abc = xyz in" means. I can't figure it out. Is this Scala syntax? If so, can you please add a normative reference to Scala and state in the Introduction that knowledge of Scala syntax is required?
> > 
> > I hope I have removed any mention of Scala in the document.
> > 
> > But I don't understand what you want me to do here. Do you think that
> > not everybody will be able to read this construct? What if I write "abc
> > = xyz" or "abc := xyz" or "abc ← xyz" instead? I believe we can take
> > "variable binding" for granted as this point, and "let ... = ... in" has
> > enough meaning in English and mathematics to go with it.
> 
> I find it weird because I see it as a mix of functional and imperative notation.
> 
> For example, rule [44]. In pure imperative style, i'd expect something like:
> 
>     [[r, fk]]ref is defined as:
>         let p = ⟦table(r), fk⟧col
>         let targetRow = dereference(r, fk)
>         let let o = φ(targetRow)
>         return (p, o)
> 
> In pure functional style, I'd expect something like:
> 
>     [[r, fk]]ref = (p, o), where:
>         p = ⟦table(r), fk⟧col,
>         o = φ(targetRow)
>         targetRow = dereference(r, fk)

There is no such thing as "pure functional style" (you can be write pure
functions in C). And making the order of evaluation explicit or not
doesn't matter.

We don't have to write in Haskell to pretend being functional.

> 
> >> 8. Explain to me why there's a "table(r)" in "⟦table(r), c⟧col" and "⟦table(r), fk⟧col". Shouldn't the static type checking catch errors of this kind?
> > 
> > I guess you wanted to point at c and fk, as they appear to be of
> > different types.
> 
> No, that's ok with me. I wanted to know why it's "table(r)" and not "r".

I wish I could export the definition from Scala... Bad refactoring
during simplification.

> 
> >> 9. State that Datatype in [9] includes String (you explicitly check for String later)
> > 
> > This is not true in the current version [4]. I'll be happy to add String
> > after the WG will have made a general decision about the SQL Datatype
> > question.
> 
> Rule [46] checks for d = String, so it's weird that String isn't explicitly listed in [9].

And it should.

> 
> >> 10. Do something about the fact that String in [9] and String in [10] are something different.
> > 
> > TableName and ColumnName are sub-types of String, but are not
> > compatible.
> 
> That's not what I meant. TableName is a subtype of String. Datatype is *not* a subtype or supertype of String. It's supposed to be an enumeration type that includes "String" as one of its possible values, as far as I can tell. This is not clear from the formal version of rule [9].

Oh I see. They are not of the same kind so this can be misleading. What
about using INT, STRING, etc. for SQL datatypes? Or another convention?

> 
> >> 13. Clarify whether tables includes views or not
> > 
> > It's an implementation detail
> 
> Nonsense. It is part of the interface, not the implementation. There won't be interoperable implementations if this is not specified, and users of the direct mapping will not know whether they can use views or not. How would you write a test case for a database that contains a view definition?
> 
> > and we don't want to go further than the
> > ADT. For example, there is notion of parser there.
> 
> I don't know what this means.

That's a good question: can RDB capture a view instead of a physical
table? I believe this is the case. That's an implementation detail in
the sense that you obtain an RDB instance in both cases. But I can be
wrong.

> 
> >> 14. Clarify whether a Database is the set of tables in a Schema or the set of tables in a Catalog (or some other set of tables)
> > 
> > Not sure what you mean here.
> 
> Do you know the difference between a Schema and a Catalog? If not, google it. Again, it's a question that any implementer needs an answer to, and many users will want an answer to, and I would prefer if they don't have to flip a coin.

It just tells you that a Database is a bunch of Tables, Table being
itself defined, and so on. How you provide these tables is up to you.

> 
> >> 17. Explain the significance of difference between the "foo(bar)" notation that is sometimes used, and the "[[bar]]foo" notation that is used at other times
> > 
> > To be honest, I'm not sure I'm able to explain this convention used in
> > all the semantics definition I've been given to read so far. It's just
> > like asking why people write "a + b" instead of "plus(a,b)".
> 
> I can answer that one. "a+b" is a notation that kids learn in school, somewhere around first or second grad, and that can be used without explanation because it is so elementary. "plus(a,b)" is less familiar, more verbose, and would actually require explanation. Therefore, obviously "a+b" is the way to go.
> 
> The notation you have chosen is concise, but it is unfamiliar and it requires explanation.

You don't get it. The choice for a notation is done in a particular
context. You don't really explain the reason why we chose to teach + as
an infix operator instead of a common prefix. And you don't really
explain why the former is easier to understand than the later. It really
depends on your public.

The thing is, people working on defining semantics for programming
languages use the square-bracket notation.

> 
> > Do you have
> > something better to propose? Like "databaseSemantics(bar, foo) = ..."?
> > That would be very ugly :-)
> 
> We're writing a specification, not an essay. Beauty is optional. Clarity is required.
> 
> > and so unnatural to so many people!
> 
> What is your basis for the claim that
> 
>     ⟦db⟧φdatabase = ...
> 
> is more natural than
> 
>     directMapping(database) = ...
> 
> ?

It's very natural for people who know a bit of formal semantics. I have
*already* proposed Eric to change that, in order to have a broader
public. We've just decided to wait for someone to bring that up. Don't
blame us for not having asked people to review that stuff.

I'm personally *not against* writing "directMapping". This doesn't
change anything in the DM.

> 
> >> 18. Change the column IRIs so that it produces valid RDF IRIs, rather than relative IRIs
> > 
> > This would need us to change the definition of the Direct Mapping itself
> > so that it depends of a "stem URI"
> 
> base IRI, not stem URI.
> 
> > that is passed around everywhere, for no added value.
> 
> Oh come on.
> 
> If you had just written the damn thing in plain English, then it would be trivial to say: 
> 
>    The input to the direct mapping is an input database and a base IRI.
>    ...
>    The column IRI of a column is the concatenation of the stem IRI and
>    the percent-encoded column name of the column.
> 
> But you've decided to use some overly restrictive formalism that requires awkward contortions to do something as trivial as this. That is a problem with the *formalism*, and no excuse for writing a mapping that claims to map RDB to RDF, but in reality maps RDB to something that isn't RDF.
> 
> I strongly disagree with the notion that fixing internal inconsistencies (the DM claims to produce absolute IRIs, but actually doesn't; it claims to produce conforming RDF graphs, but actually doesn't) has "no added value".

That is indeed an inconsistency with the RDF spec. The stem URI remains
orthogonal to the definitions itself.

> 
> > I wonder if there is a discussion in the RDF community to add relative
> > URI. At least, we have a perfect use-case here with the Direct Mapping
> > to consider this option, and I suggest the RDB2RDF WG to speak with the
> > RDF WG on this subject.
> 
> Given that the RDF spec has been around for seven years and has tens of thousands of man-hours invested in implementations, and the WG has a narrowly defined charter, such a proposal is not very likely to go anywhere.
> 
> Making this change in the Direct Mapping, on the other hand, could be as easy as adding the two sentences above in the right place, and scrapping (or fixing) the maths…

I will let Eric argument on that.

> 
> >> I'll stop here and withdraw my earlier assertion that Section 3 may be ready for Last Call.
> > 
> > I was so desperate to have someone else to look at that stuff that I
> > used the "ready for Last Call" trick.
> 
> :-)
> 
> Well, it worked.
> 
> > Richard, I *highly* appreciate
> > your review and the time you spent on it. Be sure that I will fix that
> > stuff when I'll be back from vacation (unless Eric does it before me).
> 
> Thanks, that's appreciated.
> 
> > <rant>
> 
> <counter-rant>
> 
> > Anyway, I'm really disappointed that it took so many months to have
> > another pair of eyes reading this document (more than a year actually,
> > if you consider Eric's first proposal).
> 
> I can't speak for others, but what stopped me from looking at it is the fact that the document has four editors who don't seem to speak with each other and seem to be more interested in stylistic exercises than in clear and simple communication of a technical artefact.
> 
> Quoting myself, from November last year:
> 
> “And that's the last thing I intend to say about the direct mapping  
> thingy until the three editors have managed to present the WG with a  
> single version of the document endorsed by all of them.”
> http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2010Nov/0052.html

I don't think that the single version is endorsed by anybody. We don't
even know if they are compatible. And they both pretend to be the
normative version.

> 
> It has been eight months or so since then, and to the best of my knowledge the editors have done NOTHING towards resolving the main problem of their document: that it has *four* descriptions of the same thing. That's *still* three too much.

I don't know how you count 4 :-)

But I can't let you say that the "editors have done NOTHING towards
resolving the main problem of their document [...]".

In http://www.w3.org/mid/1289680475.9296.1.camel@simplet I proposed a
way to unify both versions. In the same thread, I also raised several
technical and theoretical issues which I believe some are not yet
answered. The WG has never asked Marcelo and Juan to go through an
example, just to see how this stuff works. I still need to have a clear
answer to http://www.w3.org/mid/1295382323.21454.27.camel@simplet .

And by searching a bit more, you can find other threads with other
unanswered questions. Usually, the only answers I got is to say more or
less implicitly that I don't understand Datalog nor FOL... For some
people I'm too academic, but for others I'm not...

> 
> You can expect to get some reviews when you have made up your mind which version is the normative one.
> 
> > I'll be even more disappointed
> > if the WG decides to change fundamental things like the definition of
> > RDB (I'm still amazed how it was possible for a group called RDB2RDF to
> > not define RDB once and for all, before doing anything else),
> 
> We established early on that we use the 2008 version of ISO/IEC 9075 as the starting point for both the direct mapping and for R2RML.
> 
> > the
> > dependent types notation or even the syntax, as this has been in place
> > for a long time, with repeated calls to read this stuff in order to move
> > forward.
> 
> My position has been, since day one, that a plain English version is to be preferred over any formalism, and I have loudly expressed this more than once over the lifetime of this group.
> 
> > I just hope that Richard won't be the only one, and that others will get
> > their hands (and eyes) dirty as well.
> 
> And I hope that you, as well as the three other editors, stop fiddling around with your own little section and take responsibility for the document as a whole.

I'm ready for that. Just show me the rules in action, from RDB to RDF. I
propose to begin with
http://www.w3.org/2001/sw/rdb2rdf/directMapping/#lead-ex .

Alexandre.

> 
> Best,
> Richard
Received on Sunday, 10 July 2011 19:18:35 UTC