Re: Target audience of the Direct Mapping document?

Alexandre,

On 10 Jul 2011, at 01:33, Alexandre Bertails wrote:
> first of all, I want to make sure we both speak about the same stuff,
> that is the Section 3. I don't say *anything* about the rule approach.

That's understood. I talked about Section 3 only.

> About the formalism: it's only about functions and datatypes. You
> don't need to understand any formal logic or anything else.

3.4 is defined in terms of, quote, “higher-order functions parameterized by a function φ”. This goes a bit beyond simple “functions and datatypes”, and I don't understand why it's necessary to specify the direct mapping.

> * syntax for the _dependent types_: this may be the tricky part and I
> once proposed Eric to erase the dependent part, keep the raw type and
> put the extra information in the English text after each definition. We
> agreed that we should wait for the WG to read and make a decision
> instead.

I'm not sure what exactly dependent types are or what the difference between the two options would be. Could you give an example or two for both options?

> * syntax for _functions_: any mathematician or programmer can read
> functions.

Well, I'm not a mathematician but I'm a programmer, and in all my years I cannot remember having come across a notation that looked like "⟦ ⟧φtable" to define a function. A function, both in mathematics and in programming, usually looks like "functionname(argument,argument)".

> We've followed several advices to use the set-notation (ala
> Python and other languages with list-comprehension) to generates, while
> the first versions were using a monadic notation. As it's all about
> generating value, I refused to use any iterator-based approach to define
> this part.

What was the reason against using plain English?

>> What is a "common SQL datatype"?
> 
> I'm not sure where you found this text.

in rule [9].

> Currently, the "common SQL
> datatype"s are defined in [1]:
> [[
> Datatype    ::=   Int  |  Float  |  Date  |  …
> ]]

To be precise, this defines the term "Datatype" in the formal notation. In the equivalent English description, it defines "Datatype" as "a common SQL datatype". I do not know what constitutes a "common SQL datatype", and your document doesn't tell me.

> You started an interesting thread on the subject. I don't think that
> being exhaustive is achievable for this question (because of all the
> different implementations). I actually don't think that we want that
> either, but I'll be happy to update the related stuff when the WG will
> have decided what to do there.

The goal is to enable interoperable implementations. The handwaving you do at the moment in [9] is not good enough for that.

>> What is a "lexical value"?
> 
> Where do you find this text? I don't understand the exact context.

Rule [6]. It is used to define CellValue and has no definition of its own.

(You do know how to search for some text in a web page?)

>> What is a "candidate key"?
> 
> Formally defined at [2]:
> [[
> CandidateKey    ::=   List(ColumnName)
> ]]
> 
> The corresponding English text is:
> [[
> A candidate key is made of a list of columns (their order matters).
> ]]

But that is not the definition of a candidate key! Not any old list of column names is a candidate key!

Quoting Wikipedia: “In the relational model of databases, a candidate key of a relation is a minimal superkey for that relation; that is, …”

>>> and you clearly don't need a PhD to understand them.
>> 
>> Alexandre, I didn't ask about PhDs.
>> 
>> I asked about first-year students
> 
> It's for anybody who can read English. I believe it's enough to
> understand the whole section, without reading the maths at all.

That is such blatant nonsense that it's close to being insulting.

Hide the maths and read Section 3.4. It doesn't make any sense at all. It's just a string of phrases that don't connect.

> The "maths" just makes it's easier to proof-read and is for people who
> understands what a function is. I believe this is the case for most
> "first-year students".

See my comments above regarding notation.

> What I wanted to say is that the Direct Mapping was really designed
> and written with simplicity in mind. But it still has to be:
> 
> robust,

I don't know what that means in this context.

> correct,

It is not correct at the moment. For example, it's unable to produce RDF graphs conforming to the RDF specification.

> exhaustive,

I don't know what that means in this context.

> understandable,

A plain English expression would do a much better job at that.

> practical,

Ditto.

> usable.

Ditto.

> The math and the English text are made equivalent *on purpose*
> everywhere. I'd rather like to keep it this way for consistency.

Read the bloody thing! They are not equivalent at all!

Are these the same? No -- the formal version doesn't even mean *anything*.

  ue: String → String
  An URL encoding per WSDL urlEncoded.

Are these the same? No -- one requires an "SQL string" (what's that?), the other doesn't.

  lexicalForm ::= a Unicode String
  SQL string representing a value.

Most of the English rules in 3.4 do not say the same as their math counterparts.

> I'd prefer that we had a discussion about showing the English version by
> default or not.
> 
> Both version are intended to be normative, with the same level of
> importance.

> 
> Eric and I still disagree on what to display by default and we hope that
> the WG will take the action to decide what to do there.

What is the advantage of having two normative versions?

It makes the document harder to review, and introduces plenty of opportunities for inconsistencies to creep into the normative part.

I would prefer to make the English version complete, and remove the maths version.

>> 5. A proper reference for "IWD 9075-14:2011(E)", Google can't find it
> 
> Eric told me this wasn't that easy. Eric, the ball is in your camp :-)

Maybe a reference to ISO/IEC 9075-14:2008 will do? We already reference parts 1 and 2 of SQL 2008 in R2RML.

>> 6. An account of how row IRIs and row blank nodes are created (maybe I'm just stupid but I can't find it anywhere in Section 3)
> 
> This is not said on purpose, but I agree that we should say why.
> 
> Depending on who you are, you can do different things with this
> function. For example, 1. an database-vendor has access to more
> information (eg. the row-id) and can do a more effective mapping. 2.
> Some SQL implementations provide functions to do the same. 3. Some SQL
> implementations don't give access to this information.

It needs to be defined, because otherwise how would we ever get two interoperable implementations that generate the same graph for the same database? How would you write even a single test case?

I think the way this is handled in Section 2 is perfectly appropriate.

See also the RESOLUTION on row identifiers here:
http://www.w3.org/2011/03/01-rdb2rdf-minutes.html

My understanding is that working group resolutions are binding for the editors.

(There are more open issues raised against the direct mapping that need to be resolved before last call.)

> It doesn't change the spirit of the Direct Mapping.

Appeals to some “spirit” are misplaced. Implementations either conform or they don't; they either are compatible or not; there is no compatibility “in spirit”. The document has to define the criteria for a conforming imlementation.

>> 7. An account of what the syntax "let abc = xyz in" means. I can't figure it out. Is this Scala syntax? If so, can you please add a normative reference to Scala and state in the Introduction that knowledge of Scala syntax is required?
> 
> I hope I have removed any mention of Scala in the document.
> 
> But I don't understand what you want me to do here. Do you think that
> not everybody will be able to read this construct? What if I write "abc
> = xyz" or "abc := xyz" or "abc ← xyz" instead? I believe we can take
> "variable binding" for granted as this point, and "let ... = ... in" has
> enough meaning in English and mathematics to go with it.

I find it weird because I see it as a mix of functional and imperative notation.

For example, rule [44]. In pure imperative style, i'd expect something like:

    [[r, fk]]ref is defined as:
        let p = ⟦table(r), fk⟧col
        let targetRow = dereference(r, fk)
        let let o = φ(targetRow)
        return (p, o)

In pure functional style, I'd expect something like:

    [[r, fk]]ref = (p, o), where:
        p = ⟦table(r), fk⟧col,
        o = φ(targetRow)
        targetRow = dereference(r, fk)

>> 8. Explain to me why there's a "table(r)" in "⟦table(r), c⟧col" and "⟦table(r), fk⟧col". Shouldn't the static type checking catch errors of this kind?
> 
> I guess you wanted to point at c and fk, as they appear to be of
> different types.

No, that's ok with me. I wanted to know why it's "table(r)" and not "r".

>> 9. State that Datatype in [9] includes String (you explicitly check for String later)
> 
> This is not true in the current version [4]. I'll be happy to add String
> after the WG will have made a general decision about the SQL Datatype
> question.

Rule [46] checks for d = String, so it's weird that String isn't explicitly listed in [9].

>> 10. Do something about the fact that String in [9] and String in [10] are something different.
> 
> TableName and ColumnName are sub-types of String, but are not
> compatible.

That's not what I meant. TableName is a subtype of String. Datatype is *not* a subtype or supertype of String. It's supposed to be an enumeration type that includes "String" as one of its possible values, as far as I can tell. This is not clear from the formal version of rule [9].

>> 13. Clarify whether tables includes views or not
> 
> It's an implementation detail

Nonsense. It is part of the interface, not the implementation. There won't be interoperable implementations if this is not specified, and users of the direct mapping will not know whether they can use views or not. How would you write a test case for a database that contains a view definition?

> and we don't want to go further than the
> ADT. For example, there is notion of parser there.

I don't know what this means.

>> 14. Clarify whether a Database is the set of tables in a Schema or the set of tables in a Catalog (or some other set of tables)
> 
> Not sure what you mean here.

Do you know the difference between a Schema and a Catalog? If not, google it. Again, it's a question that any implementer needs an answer to, and many users will want an answer to, and I would prefer if they don't have to flip a coin.

>> 17. Explain the significance of difference between the "foo(bar)" notation that is sometimes used, and the "[[bar]]foo" notation that is used at other times
> 
> To be honest, I'm not sure I'm able to explain this convention used in
> all the semantics definition I've been given to read so far. It's just
> like asking why people write "a + b" instead of "plus(a,b)".

I can answer that one. "a+b" is a notation that kids learn in school, somewhere around first or second grad, and that can be used without explanation because it is so elementary. "plus(a,b)" is less familiar, more verbose, and would actually require explanation. Therefore, obviously "a+b" is the way to go.

The notation you have chosen is concise, but it is unfamiliar and it requires explanation.

> Do you have
> something better to propose? Like "databaseSemantics(bar, foo) = ..."?
> That would be very ugly :-)

We're writing a specification, not an essay. Beauty is optional. Clarity is required.

> and so unnatural to so many people!

What is your basis for the claim that

    ⟦db⟧φdatabase = ...

is more natural than

    directMapping(database) = ...

?

>> 18. Change the column IRIs so that it produces valid RDF IRIs, rather than relative IRIs
> 
> This would need us to change the definition of the Direct Mapping itself
> so that it depends of a "stem URI"

base IRI, not stem URI.

> that is passed around everywhere, for no added value.

Oh come on.

If you had just written the damn thing in plain English, then it would be trivial to say: 

   The input to the direct mapping is an input database and a base IRI.
   ...
   The column IRI of a column is the concatenation of the stem IRI and
   the percent-encoded column name of the column.

But you've decided to use some overly restrictive formalism that requires awkward contortions to do something as trivial as this. That is a problem with the *formalism*, and no excuse for writing a mapping that claims to map RDB to RDF, but in reality maps RDB to something that isn't RDF.

I strongly disagree with the notion that fixing internal inconsistencies (the DM claims to produce absolute IRIs, but actually doesn't; it claims to produce conforming RDF graphs, but actually doesn't) has "no added value".

> I wonder if there is a discussion in the RDF community to add relative
> URI. At least, we have a perfect use-case here with the Direct Mapping
> to consider this option, and I suggest the RDB2RDF WG to speak with the
> RDF WG on this subject.

Given that the RDF spec has been around for seven years and has tens of thousands of man-hours invested in implementations, and the WG has a narrowly defined charter, such a proposal is not very likely to go anywhere.

Making this change in the Direct Mapping, on the other hand, could be as easy as adding the two sentences above in the right place, and scrapping (or fixing) the maths…

>> I'll stop here and withdraw my earlier assertion that Section 3 may be ready for Last Call.
> 
> I was so desperate to have someone else to look at that stuff that I
> used the "ready for Last Call" trick.

:-)

Well, it worked.

> Richard, I *highly* appreciate
> your review and the time you spent on it. Be sure that I will fix that
> stuff when I'll be back from vacation (unless Eric does it before me).

Thanks, that's appreciated.

> <rant>

<counter-rant>

> Anyway, I'm really disappointed that it took so many months to have
> another pair of eyes reading this document (more than a year actually,
> if you consider Eric's first proposal).

I can't speak for others, but what stopped me from looking at it is the fact that the document has four editors who don't seem to speak with each other and seem to be more interested in stylistic exercises than in clear and simple communication of a technical artefact.

Quoting myself, from November last year:

“And that's the last thing I intend to say about the direct mapping  
thingy until the three editors have managed to present the WG with a  
single version of the document endorsed by all of them.”
http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2010Nov/0052.html

It has been eight months or so since then, and to the best of my knowledge the editors have done NOTHING towards resolving the main problem of their document: that it has *four* descriptions of the same thing. That's *still* three too much.

You can expect to get some reviews when you have made up your mind which version is the normative one.

> I'll be even more disappointed
> if the WG decides to change fundamental things like the definition of
> RDB (I'm still amazed how it was possible for a group called RDB2RDF to
> not define RDB once and for all, before doing anything else),

We established early on that we use the 2008 version of ISO/IEC 9075 as the starting point for both the direct mapping and for R2RML.

> the
> dependent types notation or even the syntax, as this has been in place
> for a long time, with repeated calls to read this stuff in order to move
> forward.

My position has been, since day one, that a plain English version is to be preferred over any formalism, and I have loudly expressed this more than once over the lifetime of this group.

> I just hope that Richard won't be the only one, and that others will get
> their hands (and eyes) dirty as well.

And I hope that you, as well as the three other editors, stop fiddling around with your own little section and take responsibility for the document as a whole.

Best,
Richard

Received on Sunday, 10 July 2011 14:47:29 UTC