Re: [XML-Data] review of current draft

Replies inline.

>> Herewith my review of the XML-Data document as of 2010-06-15T09:25 CEST.
>
> Thanx for the thorough review (we had, already, some discussions, off line,
> in addition).
>
>> Overall, I think the document is going in the right direction. I
>> believe it is in line with earlier discussions we had in the group
>> concerning RIF+XML combinations. There are, however, several issues
>> (mainly the comments 10-23) that I think should be resolved before
>> publication of the document as public working draft. Detailed comments
>> are below.
>
> So, let us try to resolve those issues quickly, so we can publish the WD on
> June 22 :-)
>
>> I will start with some issues which I believe require discussion in the
>> group:
>>
>> [update: in the current version of the document, issue 1 has been
>> resolved by implementing solution a)]
>>
>> 1- The document assumes that the location argument in an Import
>> directive in Core is optional (e.g., in the definition just before
>> section 4.1). This is not the case; in Core, the location argument is
>> mandatory. Thus, the document implicitly assumes an extension of Core.
>> I think it is not desirable to define such an extension, since it will
>> make the whole RIF landscape even more complex than it currently is.
>> Furthermore, this extension is problematic, since in the presentation
>> syntax it is not possible to distinguish between an Import statement
>> having only a location and one having only a profile.
>> Now, the reason for having this extension in the first place is to be
>> able to use an XML Schema as the data model of a ruleset without
>> having to specify where the XML instance data comes from. Two obvious
>> solutions that are in Core come to mind:
>> a) use a dummy URI to denote an empty XML instance document (e.g.,
>> rif:emptyXML)
>> b) put the XML Schema in the location field and define a profile for
>> XML Schema (e.g., rif:xml-schema)
>
> The reason I prefer solution (a) is that, with solution be, the schema would
> be in different locations with the same semantics, depending on whether or
> not there is a link to an MXL data document to be imported.
>
> In the updated version, I use the IRI:
> http://www.w3.org/2007/rif-import-location#no-data
>
>> 2- I find it slightly awkward to have strings as attributes in
>> frameformulas.
>> I mean as attributes in frame formulas. The way the semantics is
>> defined, element and attribute names are represented as strings in the
>> attribute position of frame formulas.
>> e.g., if you have <A B=""><C></C></A>
>>
>> this corresponds (roughly) to the RIF formula
>> ?x["attribute(B)"->"" "C" -> ""]
>>
>> I think it would be natural to require all elements in an XML document
>> to have namespaces (default namespaces are easy to add). However,
>> attributes are a slightly more complicated issue, since the default
>> namespace does not apply to them. Therefore, I don't really have an
>> elegant solution in mind at the moment.
>
> I agree that we could limit ourselves to XML documents where all the
> elements are in a namespace. That would be a restriction, but the use of
> namespaces is gaining, if not already prevalent. But the default is that
> attributes are not qualified, so, even if the elements are in a namespace,
> the attribute will not, in most cases (e.g., attributes are not qualified,
> in the RIF schemas).
>
> The rec on namespaces consider that attributes belong, de facto, in the
> naùmespace of the owner element, but the XML schema spec does not say
> anything about that AFAIK; so, we cannot just use the namespace of the owner
> element.
>
> And the lexical space of rif:iri is that of absolute IRIs, so, we cannot
> have a rif:iri with only a local name :-(
>
> That is why I included xs:NCName. But if somebody has a better solution...
>
> One question is: is it possible, for an element, to have two attributes with
> the same local name, one being in the same namespace as the element, the
> other being in no namespace? I see nothing that would forbid that case, but
> if there is, then we could follow the namespace rec and associate
> namespace-less attributes with the namespace of the owner element.

I believe it is not forbidden.

>
>> Further substantive comments:
>>
>> 10- why give separate definitions for the semantics of Core+XML and
>> BLD+XML combinations? The semantics of RIF Core is the same as that of
>> BLD; the only difference between the two dialects is the syntax. I
>> would suggest to remove section 4.2 and say that the semantics in
>> section 4.1 applies to both dialects.
>
> Your argument about saying that the semantics defined for the one applies
> for the other one as well works both way: I essentially used it the other
> way round, which seems more natural to me.
>
> Core is the core dialect, so, it seemed to make sense to specify the
> semantics of the combinations for Core, and, then, extend it to BLD and PRD,
> which, from the user point of view, are extensions of Core.

BLD is a syntactic extension, but is semantically the same. So, right
now there is a lot of duplication, in particular the definition of
combined interpretation. It is not necessary to define the semantics
twice, so don't do it.

>
> I am not convinced why we should do otherwise, but if there is overwhelming
> support to rewrite everything the other way round, I will do it.

I'm not talking about rewriting anything, only about removing the duplication.

>
>> 11- as discussed (privately), all element information items in an
>> instance of the data model are meant to be distinct. This must be
>> mentioned in the definition.
>
> Actually, that one has already been taken care of. I added the following
> paragraph/sentence, just before the definition of Core+schemaless XML
> interpretations:
> "Finally, in the remainder of this document, the notation {I_DM}  will be
> used to denote the set of all the element information items in IDM, after
> the references have been resolved. Notice that, after the references have
> been resolved, all the elements in {I_DM}  are distinct.

Ok, I missed that. This solution works for me.

>
>> 12- Is there a difference between QName and expanded QName? If so,
>> what is the difference?
>
> A QName is a string made of an optional prefix, a colon and a local name. An
> expanded QName is a triple that contains an optional prefix, an optional IRI
> and a local name, where the IRI is the IRI associated to the prefix.
>
> Shall I copy the XDM definition in the document? I thought I would put it in
> the glossary (to come in a future WD).

I actually thought a QName is a (namespace,localname) pair. It turns
out that "A qualified name is a name subject to namespace
interpretation." [1], which is a very weird definition. So I guess it
makes sense to define the concept of an expanded QName as you do.

[1] http://www.w3.org/TR/REC-xml-names/#dt-qualname

>
>> 13- section 3.2, 8. [typed value], first bullet: why do you deviate
>> from the XQuery data model?
>
> Because we need a handle to the element information itself, when it is
> object-like (that is, element-only children), so we can dig into it. And
> XDM, in that case, defines the types value as being undefined, which is
> useless in our case...

Ok. Perhaps add this explanation to the document.

>
>> 14- section 4: what is are XML instance and data documents, and what
>> is the difference with XML documents? Both notions should be defined.
>
> XML instance, or data, document as opposed to an XML schema (which is also
> an XML document; and, btw, could very well play the role of the data
> document in a combination).
>
> I think that XML instance document is usual for XML document that are
> instances of a schema.
>
> I used XML data document, or XML data, when talking of the XML data with
> which the RIF doc is combined.
>
> Do you really think that requires an explanation? Would an entry in the
> glossary be enough, or does it need be more prominently in the spec?
>
> Anyway, I fear that, most of the time I used "instance" and "data"
> interchangeably, so I have to check that, as well.

Yes, you used them interchangeably.
In addition, I think you should include explicit definitions of things
like XML instance and XML data document, since they are mentioned in
the definitions.
For example, in the definition of RIF+XML data combination, D is an
XML instance document. Does this mean the document has to be an
"instance" of a schema? What does it mean for a document to be an
instance of a schema? Is an XML schema document also and XML instance
document?
The notion of "XML instance document" needs to be defined such that it
is apparent for which class of XML documents the RIF+XML data
combinations are defined.

>
>> 15- section 4: why limit yourself to combination with only one XML
>> document? In fact, the Core syntax does not have this limitation, so
>> it is unclear how
>
> You are absolutely right, the intent is not to limit to one document. And,
> since, as you rightly pointed in a private discussion it should, the
> definition uses, now, the set {I_DM} instead of the sequence I_DM, it is
> pretty easy to correct that. I will do it before tomorrow noon, my time.

Ok, great.

>
>> 16- a RIF document is interpreted using a semantic multi-structure,
>> not a semantic structure. This needs to be taken into account in the
>> definitions in section 4.
>
> The spec says explicitly that, apart from the additional constraints in the
> definition of a semantic structure, the semantics of RIF Core/BLD+XML data
> combinations is exactly unchanged from the semantics of RIF Core/BLD.
>
> Is not that sufficient? Or do you think there are also differences in the
> handling of multi-structures? I did not check, to say the truth :-(

It is definitely not sufficient. There needs to be a connection
between the RIF+XML combinations, on the one hand, and the combined
interpretations, on the other. you need to define how this connection
between syntactic and semantic entities is made.

>
> I will add (see 17, below), that everything else is unchanged, when
> replacing RIF Core/BLD semantic structures with RIF Core/BLD+XML data
> combined intepretations in the definitions. Is that ok?

I'm afraid that does not work. If you were to do that, the definitions
would be incoherent.

>
> I will, but I am not sure that I do anything in time for a publication on
> June 22: if changes need be made, can we do with an editor's note for that
> round?

I guess it should be fine to say in an editor's note that the
semantics is broadly in line with the Core semantics and that the
definitions of satisfaction, consistency, and entailment will be
included in the next version.

>
>> 17- notions of consistency and entailment, based on combined
>> interpretations, need to be defined for RIF+XML combinations. Stating
>> that these notions remain unchanged from Core does not work, since you
>> do not have Core structures, but combined interpretations here.
>
> Well, combined interpretations are semantic structures for RIF+XML data
> combinations, aren't they?
>
> Anyway, would "the definitions of [these notions] remain unchanged, except
> that every reference to a semantic structure I is replaced by a reference to
> a combined interpretation <I, I_DM>" do?
>
> Or do you think that the definitions should be repeated? But that would
> complexify the spec unnecessarily (or, rather, give it the appearance of
> complexity), I think.

There is certainly a change compared with the Core semantics, so there
would be no repetition. the definitions might look similar, but they
are not the same. You might even get away with just defining the
modifications of the respective definitions in Core, but I'm not sure
at the moment.

As I mentioned above, including an editor's note along the lines as I
mentioned should be fine for this publication.

>
>> 18- section 4.1, 4th paragraph: constants are not "in" any lexical
>> space. Constants have the form l^^s, where l is a string and s an IRI
>> denoting a symbol space.
>
> I will correct the terminology. By tomorrow noon.

Ok, great

>
>> 19- section 4.1.1, first bullet: the definition of string-matches is a
>> bit hard to read and overly restrictive (e.g., it does not account for
>> rdf:PlainLiterals without language tags). I would suggest to either
>> match L_dt(c) (here, L_dt is the lexical-to-value mapping of the
>> datatype of c) with [string value] or, better yet, just give a
>> semantic definition: a string s string-matches i iff s=[string value]
>> after white space normalization [of both s and [string value], I
>> presume]. Similar for the second bullet.
>
> That is what I thought it said (after I changed the definition after our
> earlier discussion on the subject)!
>
> :-)
>
> But I will revise, using your suggested wording. By tomorrow noon.
>
>> 20- definition in sec 4.1.1, 2.: the condition does not take frame
>> formulas with multiple attributes, nor equality between IRIs into
>> account. I would suggest to work on the semantic level, giving the
>> definition in terms of domain elements and the I_frame mapping. Also,
>> when speaking about domain values, you can speak directly of strings,
>> rather than strings obtained from constants. Similar for bullet 3 and
>> the corresponding bullets in the definition in sec 4.1.2. In addition,
>> when using a semantic definition in sec 4.1.2, you no longer need to
>> do type matching; all you need to do is require that the value on the
>> RIF side is equal to [typed value], when discarding the type label.
>
> Ok. I did not think I could do it, but, now, I think I understand how...
>
> I will try to do that by tomorrow noon.

If you are having some trouble with the definition, I might be able to
help. But I'm afraid I do not have time to work on it before June
22nd, so you could leave things as they are and include an editor's
note saying that the definition will be changed along the lines of my
comment for the next version of the document.

>
>> 21- section 4.1.3: what is the operational semantics of Core? It's not
>> in the Core spec.
>
> Well, the spec says [1]: "RIF-Core is [also] a syntactic subset of RIF-PRD,
> and the semantics of RIF-Core is [also] identical to the semantics of
> RIF-PRD for that subset." And the primary semantics of PRD is the
> operational one.

Sure, but the operational semantics of Core is not explicitly
mentioned in the Core spec, so it is awkward to mention it here. A
reader who is only interested in Core and combination with XML might
not have read the PRD spec and is not aware of this operational
semantics.

>
> [1] http://www.w3.org/TR/rif-core/#RIF-Core_Semantics
>
>> 22- definition in section 4.1.2: the first condition in both 3a and 3b
>> (the existence of a corresponding element in the XSD) seems redundant,
>> since I_DM is based on a PSVI, and so must be schema-valid. Is that
>> true?
>
> The condition is needed to take substitution groups into account: you can
> have a substitution group where the head never occurs in the XML data, but
> the rule is written against the head element.

I missed the substitution. Now I'm actually a bit confused. Is the
substitution mentioned in 3a related to the XSD element that matches
with C? If so, this connection should be made explicit. Otherwise, I
think it should be made clear what the substitution is, because if you
can just use any substitution, you always have a match.
Then, in 3b you do not speak about substitutions, but C must match the
name of e directly. So, here the condition does seem redundant. Or did
you just forget the substitution here?

>
>> 23- definition in section 4.1.2: right now I cannot foresee the
>> consequences of condition 4. It seems that including all possible XML
>> datatypes is a problem, for example we already identified that the
>> duration datatype poses a problem for RIF. The question is whether
>> there are possible other datatypes that pose problems. Datatypes that
>> are derived from types that are in RIF do not need to be included in
>> DTS, since their value spaces are are necessarily subsets of D_Ind and
>> there are syntactic representations of all the values.
>> For this round of publication, I would suggest to add at least an
>> editor's note saying that the condition will be further refined in
>> future versions.
>
> Condition is not about including all possible XML datatypes, but the ones
> that are used in the XML data doc or the associated XML schema.
>
> The datatypes that were problematic for DTB, were problematic because they
> were not usually implemented, or consisted wit hthe one implemented, in most
> or mainstream rule engines.

There is also the semantic problem of duration: the definition of the
datatype makes things ambiguous, so you, in the end, do not know what
the entailments are.

>
> But if a data doc or a schema uses a datatype that your implementation does
> not support, your in trouble if you want to use it anyway, so I do not think
> this is a problem...
>
> Anyway, I certainly have nothing against an editor's note to call attention
> to, and ask feedback on, possibly unforeseen consequences.

Ok, good.
Then, you did not respond to the second part of my comment:

>> Datatypes that
>> are derived from types that are in RIF do not need to be included in
>> DTS, since their value spaces are are necessarily subsets of D_Ind and
>> there are syntactic representations of all the values.


>
>> Editorial comments:
>>
>> 101- Sec 3.1, 4th paragraph: references should be included that
>> explain what general and external parsed entities are and how they are
>> expanded
>
> Yes, many references need be added. I will add that one by tomorrow noon.
>
>> 102- There is a definition of an "instance of the data model", but not
>> of the data model. Given that there is no such definition, I think it
>> unwise to speak about instances of it, since this only makes the spec
>> harder to understand
>
> Hmmm, I thought that most of section 3 what about the definition of the data
> model...

Aha! I did not realize that, because there is no definition of it. In
fact, you don't refer to it other than in the phrase "instance of the
data model". It is actually not clear how it is an instance (e.g., a
sequence of attribute information items does not appear to be an
instance); it is simply a sequence of element information items. So
why mention "the data model" at all?

>
> Sorry, I think that I do not understand your comment: can you reformulate
> it, please? Or give an example where the use of "instance of the data model"
> makes the spec harder to undertsnad?

In the phrase, "the data model" does not give any added information to
the reader; it only serves to distract from the content.

>
>> 103- Section 4, first paragraph: why introduce the additional term
>> "interpretation" here? I would suggest to stick with the term
>> "structure", as in the other RIF specs.
>
> Semantic structures are often called interpretations. And I am more familiar
> with that term.
>
> I will remove the introduction of the term there, but I hope that you will
> allow me its use wherever else it is used :-)

Actually, not in the spec :)
I also prefer the term "interpretation", but RIF interpretation are
simply called "semantic structures", so we need to stick with the
terminology when referring to those. Of course, "combined
interpretations" are fine since they are not RIF structures, but they
/are/ pairs of RIF structures and instances.

>
>> 104- editor's note just above sec 4.1.1: yes, I think it should be
>> said explicitly
>
> Ok. I will do the change.
>
>> 105- definition in section 4.1.1: the notation {I_DM} is somewhat
>> redundant with the requirement in the definition that all references
>> in I_DM have been resolved
>
> Well, you prompted me to introduce that notation, when you remarked that it
> was the set of the elements in I_DM that had to be included in D_ind, not
> I_DM itself, which is a sequence with possibly duplucated element
> information items...

The notation is fine. The point is that you say now in two places that
all the references have been resolved; you need to say it only in one
place.

>
>> Further questions:
>>
>> 1001- Is it true that it is guaranteed that every element and every
>> attribute has a type in a PSVI infoset? In a schema it is possible to
>> write such vague things as xs:any, thereby not actually specifying the
>> type of a particular element.
>
> See http://www.w3.org/TR/xpath-datamodel/#PSVI2NodeTypes :-)
>
> Sorry, it is a bit late for me to think that clearly at this time of day. I
> will try to respond to that tomorrow.

Looking forward to the response :)


Cheers, Jos

>
> Thanx again for the comments. The version of the draft updated to take them
> into account should be ready tomorrow by noon.
>
> Cheers,
>
> Christian
>
> IBM
> 9 rue de Verdun
> 94253 - Gentilly cedex - FRANCE
> Tel. +33 1 49 08 35 00
> Fax +33 1 49 08 35 10
>
>
> Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
> Compagnie IBM France
> Siege Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
> RCS Nanterre 552 118 465
> Forme Sociale : S.A.S.
> Capital Social : 611.451.766,20 €
> SIREN/SIRET : 552 118 465 03644
>
>



-- 
Jos de Bruijn
  Web:          http://www.debruijn.net/
  LinkedIn:     http://at.linkedin.com/in/josdebruijn

Received on Wednesday, 16 June 2010 08:05:51 UTC