Re: New merged consolidated Direct Mapping version

Hi Alexandre,

Thanks again for your comments.

On Tue, Nov 16, 2010 at 1:06 PM, Alexandre Bertails <bertails@w3.org> wrote:
> On Tue, 2010-11-16 at 06:19 -0300, Marcelo Arenas wrote:
>> Hi Alexandre,
>>
>> Thank you very much for you comments.
>
> You're welcome, I'm just sharing my feedbacks.
>
>>
>> On Mon, Nov 15, 2010 at 8:28 PM, Alexandre Bertails <bertails@w3.org> wrote:
>> > Sorry for not answering earlier, RDB2RDF is not my real job at W3C :-)
>> >
>> > On Sat, 2010-11-13 at 15:46 -0600, Juan Sequeda wrote:
>> >> Alexandre
>> >>
>> >>
>> >> you make good points which I need to read thoroughly but I don't want
>> >> to do over the weekend ;)
>> >>
>> >>
>> >> However, quick comments inline
>> >>
>> >> On Sat, Nov 13, 2010 at 3:37 PM, Alexandre Bertails <bertails@w3.org>
>> >> wrote:
>> >>         On Sat, 2010-11-13 at 14:47 -0600, Juan Sequeda wrote:
>> >>         > I'd like to go through this thoroughly but I believe this
>> >>         looks a lot like:
>> >>         >
>> >>         >
>> >>         http://www.w3.org/2001/sw/rdb2rdf/wiki/Database-Instance-Only_and_Database-Instances-and-Schema_Mapping
>> >>         >
>> >>         > This was Marcelo and my proposal a longggg time ago.
>> >>
>> >>
>> >>         Yes, Eric made me read it a longggg time ago :-) But this is
>> >>         not the
>> >>         same approach (and I prefer the one you took in the merged
>> >>         document).
>> >>
>> >>         In the merged spec, you say things like [[ Assume that r(a,
>> >>         b1, ...,
>> >>         bn) is a table with columns a, b1, ..., bn ... ]]. It's not
>> >>         clear if
>> >>         it means "I have a function from a relation r in RDB to a
>> >>         Datalog
>> >>         rule", or if you are giving an axiomatic description of the
>> >>         truth in a
>> >>         particular case.
>> >>
>> >>         I understood it as an axiomatic description with universal
>> >>         quantification (the universe of discourse, which is also
>> >>         missing in
>> >>         your rules) because as there is no reason to keep two models
>> >>         of
>> >>         computation in the same spec, I assumed you were not competing
>> >>         with
>> >>         the mapping (which I recall is by definition a function)
>> >>         itself by
>> >>         proposing a new one. And if this was actually a function from
>> >>         RDB to
>> >>         Datalog, I would have expected to see the formal definition of
>> >>         a
>> >>         function with a clear domain and codomain.
>> >>
>> >>
>> >> there is no function from RDB to Datalog.
>> >> Datalog can be considered syntax for relational algebra. You can say
>> >> the same thing. IMO, I prefer reading datalog than relational algebra.
>> >> So r is the name of the table. i.e project attribute name from the
>> >> table student
>> >>
>> >>
>> >> Ans(name) <- Student(_, name, _, _)
>> >
>> > In Datalog, you cannot reason on the relation r itself. So you need
>> > something external to go from the relation r to the relation name "r".
>> > Said differently, as long as you'll put an "r" in a Predicate, this is
>> > not FOL.
>> >
>> > How do you make the distinction between the relation and its name? Eric
>> > showed me a scheme but he called it "perverse" :-) And he still needs
>> > higher-order.
>> >
>> >>         So to be sure I was understanding your rules, I spontaneously
>> >>         started
>> >>         to annotate the variables and then, to get rid of the English
>> >>         (I
>> >>         always have a problem to consider descriptions in English as
>> >>         they
>> >>         escape from the formalism and hide the difficulty), I pushed
>> >>         the
>> >>         plain-text constraints into the rules, one after one. I found
>> >>         very
>> >>         pleasant to see that you actually use Higher Order Logic (the
>> >>         [[
>> >>         Assume that ]] were the clue but I did not get it right away).
>> >>         By
>> >>         putting more formalism into the rules, I really understood you
>> >>         were
>> >>         giving a nice semantical framework for the Direct Mapping,
>> >>         more than
>> >>         giving a way to compute it. The icing on the cake is that you
>> >>         never
>> >>         have to say *how* you compute an IRI, for example. You just
>> >>         have to
>> >>         say that it exists!
>> >>
>> >>
>> >>
>> >>
>> >> If you are combining the instances of the database AND schema elements
>> >> (Student is a table, id is a PK of the student table), then it becomes
>> >> higher order logic. Hence we had a schema+instance mapping. But
>> >> Marcelo and I came to the conclusion that it was too complicated.
>> >> Hence we only wanted Instance Mapping.
>> >
>> > Yes I agree it's complicated.
>> >
>> > Mixing schema and data is *a* way to get higher-order. But as long as
>> > you have the table name outside of a predicate position, this is gonna
>> > be higher-order.
>> >
>> >
>> >>
>> >>         The algebra tells you the "what" (the Abstract Models) and the
>> >>         "how"
>> >>         (the mapping functions), whereas your Axiomatic Semantics
>> >>         tells you
>> >>         the truth in the model.
>> >>
>> >>         May I suggest the editors (Eric, that includes you) to make
>> >>         clear the
>> >>         relation between the Direct Mapping (the algebra) and its
>> >>         Axiomatic
>> >>         Semantics?
>> >>
>> >>
>> >> Yes we need to do that.
>> >
>> > Editorial proposal to put somewhere in the introduction:
>> > [[
>> > The Direct Mapping is an algebra defining the mapping from RDB to RDF,
>> > expressed in Type Theory. The Axiomatic Semantics defines the set of
>> > laws which the Direct Mapping must respect.
>> > ]]
>>
>> I don't agree with including this paragraph in the introduction. We
>> want people to read the document, so I like the idea of having
>> alternative formalizations of the direct mapping, each one with their
>> own perspective. One of them is Eric's proposal, for which you
>> paragraph is appropriate. The other one is based on Datalog, which is
>> a familiar notation for database people, but which follows a different
>> approach.
>
> I'm not sure who "database people" are. If you mean "people in research"
> and/or "people speaking Datalog", it means we have to change the
> targeted audience again and the terminology we use.
>
> I believe that the vast majority of people who will be interested in
> RDB2RDF won't know anything about Datalog -- and its semantics -- and
> are not interested in it. But they all have a pretty good understanding
> of functions (maps).

I disagree with this. Datalog is just a name for a rule language that
is widely used in the database area. In fact, you can find lots of
reincarnations of this language in the database area (not necessarily
with the name Datalog). Let me give just a few examples: it has been
used as a dependency language (tuple-generating dependencies), as a
view definition language, as a language for expressing conjunctive
queries and some of its extensions, as a language to express the
relationship between global and local schemas in data integration
systems (in particular, in the GAV setting), as a mapping language in
data exchange, and as a unified language to ontologies and integrity
constraints. Actually, even the key work on optimizing conjunctive
queries under multisets semantics ("real" conjunctive queries) was
done by using a Datalog syntax for conjunctive queries:

Surajit Chaudhuri, Moshe Y. Vardi: Optimization of Real Conjunctive
Queries. PODS 1993: 59-70

Notice that the above paper does not mention the word Datalog, but is
uses a Datalog syntax (and semantics) because this is a simple rule
language with a simple semantics.


>> Actually, I would like  to point out here that the way we are
>> representing the direct mapping in Datalog is pretty standard in
>> database theory.
>
> I agree with the general statement, especially for relational database
> theory.
>
>> In fact, I have the impression that some of your
>> concerns about this representation are coming from the fact that you
>> are not familiar with the language.
>
> I thought Datalog was just first-order predicate calculus, relying on
> sets. Don't worry about that, I've manipulated much more difficult
> formalisms in the past. And I've read the articles you shared with us.

Sorry, I didn't make myself clear. I was not trying to say that you
cannot deal with Datalog (I am pretty sure that you can), I was just
saying that your are still not completely familiar with the way the
syntax and semantics of Datalog is defined (see, as an example, my
comments about types at the end of this email).

> You must understand we are facing some real-world situations here. For
> example, RDB implementations work on top of multisets, not sets. And
> Datalog is not as accessible for non-researchers than a simple
> well-defined function.
>
>>  Just as an example, we are not
>> missing the universal quantifiers in our rules. All the variables in a
>> Datalog rule are universally quantified, so the universal quantifiers
>> are omitted (Datalog is a fragment of first-order logic that uses some
>> non-first-order notation).
>
> Have you read my rewritten rules? Do you think they are wrong or that
> they don't comply with yours? They just show this is not Datalog as you
> implicitly use predicates in variable position.
>
> Before answering to that, please specify the domain of your predicates
> (which you are supposed to do in Datalog anyway). For example, I still
> don't know where you encode types for values.

Types are not needed to give semantics to a Datalog program. In fact,
types are not needed to define the semantics of first-order logic.
First-order formulas are evaluated over structures, which consist of a
domain and an interpretation of each symbol in the vocabulary
(relations, functions and constants). In particular, the universal and
existential quantifiers of first-order formulas range over all the
elements of the domain (without making any distinction between these
elements). The semantics of Datalog is defined in the same way, but
given that Datalog rules are safe (and, thus, domain independent), one
does not need to refer to the underlaying domain to define the
semantics of these rules (it is enough with the elements in the
relations, that is, with the active domain).

All the best,

Marcelo

Received on Wednesday, 17 November 2010 15:48:21 UTC