Re: [XML-Data] review of current draft from Christian De Sainte Marie on 2010-06-15 (public-rif-wg@w3.org from June 2010)

From: Christian De Sainte Marie <csma@fr.ibm.com>
Date: Tue, 15 Jun 2010 22:32:20 +0200
To: Jos de Bruijn <jos.debruijn@gmail.com>
Cc: RIF <public-rif-wg@w3.org>
Message-ID: <OFED3DFC45.A63F6E5D-ONC1257743.004F7619-C1257743.0070D4CD@fr.ibm.com>
Jos, all,

Jos wrote on 15/06/2010 14:08:09:
> 
> Herewith my review of the XML-Data document as of 2010-06-15T09:25 CEST.

Thanx for the thorough review (we had, already, some discussions, off 
line, in addition).
 
> Overall, I think the document is going in the right direction. I
> believe it is in line with earlier discussions we had in the group
> concerning RIF+XML combinations. There are, however, several issues
> (mainly the comments 10-23) that I think should be resolved before
> publication of the document as public working draft. Detailed comments
> are below.

So, let us try to resolve those issues quickly, so we can publish the WD 
on June 22 :-)

> I will start with some issues which I believe require discussion in the 
group:
> 
> [update: in the current version of the document, issue 1 has been
> resolved by implementing solution a)]
> 
> 1- The document assumes that the location argument in an Import
> directive in Core is optional (e.g., in the definition just before
> section 4.1). This is not the case; in Core, the location argument is
> mandatory. Thus, the document implicitly assumes an extension of Core.
> I think it is not desirable to define such an extension, since it will
> make the whole RIF landscape even more complex than it currently is.
> Furthermore, this extension is problematic, since in the presentation
> syntax it is not possible to distinguish between an Import statement
> having only a location and one having only a profile.
> Now, the reason for having this extension in the first place is to be
> able to use an XML Schema as the data model of a ruleset without
> having to specify where the XML instance data comes from. Two obvious
> solutions that are in Core come to mind:
> a) use a dummy URI to denote an empty XML instance document (e.g., 
> rif:emptyXML)
> b) put the XML Schema in the location field and define a profile for
> XML Schema (e.g., rif:xml-schema)

The reason I prefer solution (a) is that, with solution be, the schema 
would be in different locations with the same semantics, depending on 
whether or not there is a link to an MXL data document to be imported.

In the updated version, I use the IRI: 
http://www.w3.org/2007/rif-import-location#no-data 

> 2- I find it slightly awkward to have strings as attributes in 
frameformulas.
> I mean as attributes in frame formulas. The way the semantics is
> defined, element and attribute names are represented as strings in the
> attribute position of frame formulas.
> e.g., if you have <A B=""><C></C></A>
> 
> this corresponds (roughly) to the RIF formula
> ?x["attribute(B)"->"" "C" -> ""]
> 
> I think it would be natural to require all elements in an XML document
> to have namespaces (default namespaces are easy to add). However,
> attributes are a slightly more complicated issue, since the default
> namespace does not apply to them. Therefore, I don't really have an
> elegant solution in mind at the moment.

I agree that we could limit ourselves to XML documents where all the 
elements are in a namespace. That would be a restriction, but the use of 
namespaces is gaining, if not already prevalent. But the default is that 
attributes are not qualified, so, even if the elements are in a namespace, 
the attribute will not, in most cases (e.g., attributes are not qualified, 
in the RIF schemas).

The rec on namespaces consider that attributes belong, de facto, in the 
naùmespace of the owner element, but the XML schema spec does not say 
anything about that AFAIK; so, we cannot just use the namespace of the 
owner element.

And the lexical space of rif:iri is that of absolute IRIs, so, we cannot 
have a rif:iri with only a local name :-(

That is why I included xs:NCName. But if somebody has a better solution...

One question is: is it possible, for an element, to have two attributes 
with the same local name, one being in the same namespace as the element, 
the other being in no namespace? I see nothing that would forbid that 
case, but if there is, then we could follow the namespace rec and 
associate namespace-less attributes with the namespace of the owner 
element.

> Further substantive comments:
> 
> 10- why give separate definitions for the semantics of Core+XML and
> BLD+XML combinations? The semantics of RIF Core is the same as that of
> BLD; the only difference between the two dialects is the syntax. I
> would suggest to remove section 4.2 and say that the semantics in
> section 4.1 applies to both dialects.

Your argument about saying that the semantics defined for the one applies 
for the other one as well works both way: I essentially used it the other 
way round, which seems more natural to me.

Core is the core dialect, so, it seemed to make sense to specify the 
semantics of the combinations for Core, and, then, extend it to BLD and 
PRD, which, from the user point of view, are extensions of Core.

I am not convinced why we should do otherwise, but if there is 
overwhelming support to rewrite everything the other way round, I will do 
it.

> 11- as discussed (privately), all element information items in an
> instance of the data model are meant to be distinct. This must be
> mentioned in the definition.

Actually, that one has already been taken care of. I added the following 
paragraph/sentence, just before the definition of Core+schemaless XML 
interpretations:
"Finally, in the remainder of this document, the notation {I_DM}  will be 
used to denote the set of all the element information items in IDM, after 
the references have been resolved. Notice that, after the references have 
been resolved, all the elements in {I_DM}  are distinct.

I will add (see 17, below), that everything else is unchanged, when 
replacing RIF Core/BLD semantic structures with RIF Core/BLD+XML data 
combined intepretations in the definitions. Is that ok?

> 12- Is there a difference between QName and expanded QName? If so,
> what is the difference?

A QName is a string made of an optional prefix, a colon and a local name. 
An expanded QName is a triple that contains an optional prefix, an 
optional IRI and a local name, where the IRI is the IRI associated to the 
prefix.

Shall I copy the XDM definition in the document? I thought I would put it 
in the glossary (to come in a future WD).

> 13- section 3.2, 8. [typed value], first bullet: why do you deviate
> from the XQuery data model?

Because we need a handle to the element information itself, when it is 
object-like (that is, element-only children), so we can dig into it. And 
XDM, in that case, defines the types value as being undefined, which is 
useless in our case...

> 14- section 4: what is are XML instance and data documents, and what
> is the difference with XML documents? Both notions should be defined.

XML instance, or data, document as opposed to an XML schema (which is also 
an XML document; and, btw, could very well play the role of the data 
document in a combination).

I think that XML instance document is usual for XML document that are 
instances of a schema.

I used XML data document, or XML data, when talking of the XML data with 
which the RIF doc is combined.

Do you really think that requires an explanation? Would an entry in the 
glossary be enough, or does it need be more prominently in the spec?

Anyway, I fear that, most of the time I used "instance" and "data" 
interchangeably, so I have to check that, as well.

> 15- section 4: why limit yourself to combination with only one XML
> document? In fact, the Core syntax does not have this limitation, so
> it is unclear how

You are absolutely right, the intent is not to limit to one document. And, 
since, as you rightly pointed in a private discussion it should, the 
definition uses, now, the set {I_DM} instead of the sequence I_DM, it is 
pretty easy to correct that. I will do it before tomorrow noon, my time.

> 16- a RIF document is interpreted using a semantic multi-structure,
> not a semantic structure. This needs to be taken into account in the
> definitions in section 4.

The spec says explicitly that, apart from the additional constraints in 
the definition of a semantic structure, the semantics of RIF Core/BLD+XML 
data combinations is exactly unchanged from the semantics of RIF Core/BLD.

Is not that sufficient? Or do you think there are also differences in the 
handling of multi-structures? I did not check, to say the truth :-(

I will, but I am not sure that I do anything in time for a publication on 
June 22: if changes need be made, can we do with an editor's note for that 
round?

> 17- notions of consistency and entailment, based on combined
> interpretations, need to be defined for RIF+XML combinations. Stating
> that these notions remain unchanged from Core does not work, since you
> do not have Core structures, but combined interpretations here.

Well, combined interpretations are semantic structures for RIF+XML data 
combinations, aren't they?

Anyway, would "the definitions of [these notions] remain unchanged, except 
that every reference to a semantic structure I is replaced by a reference 
to a combined interpretation <I, I_DM>" do?

Or do you think that the definitions should be repeated? But that would 
complexify the spec unnecessarily (or, rather, give it the appearance of 
complexity), I think.

> 18- section 4.1, 4th paragraph: constants are not "in" any lexical
> space. Constants have the form l^^s, where l is a string and s an IRI
> denoting a symbol space.

I will correct the terminology. By tomorrow noon.

> 19- section 4.1.1, first bullet: the definition of string-matches is a
> bit hard to read and overly restrictive (e.g., it does not account for
> rdf:PlainLiterals without language tags). I would suggest to either
> match L_dt(c) (here, L_dt is the lexical-to-value mapping of the
> datatype of c) with [string value] or, better yet, just give a
> semantic definition: a string s string-matches i iff s=[string value]
> after white space normalization [of both s and [string value], I
> presume]. Similar for the second bullet.

That is what I thought it said (after I changed the definition after our 
earlier discussion on the subject)!

:-)

But I will revise, using your suggested wording. By tomorrow noon.

> 20- definition in sec 4.1.1, 2.: the condition does not take frame
> formulas with multiple attributes, nor equality between IRIs into
> account. I would suggest to work on the semantic level, giving the
> definition in terms of domain elements and the I_frame mapping. Also,
> when speaking about domain values, you can speak directly of strings,
> rather than strings obtained from constants. Similar for bullet 3 and
> the corresponding bullets in the definition in sec 4.1.2. In addition,
> when using a semantic definition in sec 4.1.2, you no longer need to
> do type matching; all you need to do is require that the value on the
> RIF side is equal to [typed value], when discarding the type label.

Ok. I did not think I could do it, but, now, I think I understand how...

I will try to do that by tomorrow noon. 

> 21- section 4.1.3: what is the operational semantics of Core? It's not
> in the Core spec.

Well, the spec says [1]: "RIF-Core is [also] a syntactic subset of 
RIF-PRD, and the semantics of RIF-Core is [also] identical to the 
semantics of RIF-PRD for that subset." And the primary semantics of PRD is 
the operational one.

[1] http://www.w3.org/TR/rif-core/#RIF-Core_Semantics

> 22- definition in section 4.1.2: the first condition in both 3a and 3b
> (the existence of a corresponding element in the XSD) seems redundant,
> since I_DM is based on a PSVI, and so must be schema-valid. Is that
> true?

The condition is needed to take substitution groups into account: you can 
have a substitution group where the head never occurs in the XML data, but 
the rule is written against the head element.

> 23- definition in section 4.1.2: right now I cannot foresee the
> consequences of condition 4. It seems that including all possible XML
> datatypes is a problem, for example we already identified that the
> duration datatype poses a problem for RIF. The question is whether
> there are possible other datatypes that pose problems. Datatypes that
> are derived from types that are in RIF do not need to be included in
> DTS, since their value spaces are are necessarily subsets of D_Ind and
> there are syntactic representations of all the values.
> For this round of publication, I would suggest to add at least an
> editor's note saying that the condition will be further refined in
> future versions.

Condition is not about including all possible XML datatypes, but the ones 
that are used in the XML data doc or the associated XML schema.

The datatypes that were problematic for DTB, were problematic because they 
were not usually implemented, or consisted wit hthe one implemented, in 
most or mainstream rule engines.

But if a data doc or a schema uses a datatype that your implementation 
does not support, your in trouble if you want to use it anyway, so I do 
not think this is a problem...

Anyway, I certainly have nothing against an editor's note to call 
attention to, and ask feedback on, possibly unforeseen consequences.

> Editorial comments:
> 
> 101- Sec 3.1, 4th paragraph: references should be included that
> explain what general and external parsed entities are and how they are
> expanded

Yes, many references need be added. I will add that one by tomorrow noon.

> 102- There is a definition of an "instance of the data model", but not
> of the data model. Given that there is no such definition, I think it
> unwise to speak about instances of it, since this only makes the spec
> harder to understand

Hmmm, I thought that most of section 3 what about the definition of the 
data model...

Sorry, I think that I do not understand your comment: can you reformulate 
it, please? Or give an example where the use of "instance of the data 
model" makes the spec harder to undertsnad?

> 103- Section 4, first paragraph: why introduce the additional term
> "interpretation" here? I would suggest to stick with the term
> "structure", as in the other RIF specs.

Semantic structures are often called interpretations. And I am more 
familiar with that term.

I will remove the introduction of the term there, but I hope that you will 
allow me its use wherever else it is used :-)
 
> 104- editor's note just above sec 4.1.1: yes, I think it should be
> said explicitly

Ok. I will do the change.

> 105- definition in section 4.1.1: the notation {I_DM} is somewhat
> redundant with the requirement in the definition that all references
> in I_DM have been resolved

Well, you prompted me to introduce that notation, when you remarked that 
it was the set of the elements in I_DM that had to be included in D_ind, 
not I_DM itself, which is a sequence with possibly duplucated element 
information items...

> Further questions:
> 
> 1001- Is it true that it is guaranteed that every element and every
> attribute has a type in a PSVI infoset? In a schema it is possible to
> write such vague things as xs:any, thereby not actually specifying the
> type of a particular element.

See http://www.w3.org/TR/xpath-datamodel/#PSVI2NodeTypes :-)

Sorry, it is a bit late for me to think that clearly at this time of day. 
I will try to respond to that tomorrow.

Thanx again for the comments. The version of the draft updated to take 
them into account should be ready tomorrow by noon.

Cheers,

Christian

IBM
9 rue de Verdun
94253 - Gentilly cedex - FRANCE
Tel. +33 1 49 08 35 00
Fax +33 1 49 08 35 10


Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Siege Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 611.451.766,20 ?
SIREN/SIRET : 552 118 465 03644
Received on Tuesday, 15 June 2010 20:33:06 UTC