[XML-Data] review of current draft from Jos de Bruijn on 2010-06-15 (public-rif-wg@w3.org from June 2010)

From: Jos de Bruijn <jos.debruijn@gmail.com>
Date: Tue, 15 Jun 2010 14:08:09 +0200
To: RIF <public-rif-wg@w3.org>
Message-ID: <AANLkTikJ--Xl6mpCi_52r9xFYa-LakJudoik7qtFESJS@mail.gmail.com>
Christian, all,

Herewith my review of the XML-Data document as of 2010-06-15T09:25 CEST.

Overall, I think the document is going in the right direction. I
believe it is in line with earlier discussions we had in the group
concerning RIF+XML combinations. There are, however, several issues
(mainly the comments 10-23) that I think should be resolved before
publication of the document as public working draft. Detailed comments
are below.



I will start with some issues which I believe require discussion in the group:

[update: in the current version of the document, issue 1 has been
resolved by implementing solution a)]

1- The document assumes that the location argument in an Import
directive in Core is optional (e.g., in the definition just before
section 4.1). This is not the case; in Core, the location argument is
mandatory. Thus, the document implicitly assumes an extension of Core.
I think it is not desirable to define such an extension, since it will
make the whole RIF landscape even more complex than it currently is.
Furthermore, this extension is problematic, since in the presentation
syntax it is not possible to distinguish between an Import statement
having only a location and one having only a profile.
Now, the reason for having this extension in the first place is to be
able to use an XML Schema as the data model of a ruleset without
having to specify where the XML instance data comes from. Two obvious
solutions that are in Core come to mind:
a) use a dummy URI to denote an empty XML instance document (e.g., rif:emptyXML)
b) put the XML Schema in the location field and define a profile for
XML Schema (e.g., rif:xml-schema)


2- I find it slightly awkward to have strings as attributes in frame formulas.
I mean as attributes in frame formulas. The way the semantics is
defined, element and attribute names are represented as strings in the
attribute position of frame formulas.
e.g., if you have <A B=""><C></C></A>

this corresponds (roughly) to the RIF formula
?x["attribute(B)"->"" "C" -> ""]

I think it would be natural to require all elements in an XML document
to have namespaces (default namespaces are easy to add). However,
attributes are a slightly more complicated issue, since the default
namespace does not apply to them. Therefore, I don't really have an
elegant solution in mind at the moment.


Further substantive comments:

10- why give separate definitions for the semantics of Core+XML and
BLD+XML combinations? The semantics of RIF Core is the same as that of
BLD; the only difference between the two dialects is the syntax. I
would suggest to remove section 4.2 and say that the semantics in
section 4.1 applies to both dialects.
11- as discussed (privately), all element information items in an
instance of the data model are meant to be distinct. This must be
mentioned in the definition.
12- Is there a difference between QName and expanded QName? If so,
what is the difference?
13- section 3.2, 8. [typed value], first bullet: why do you deviate
from the XQuery data model?
14- section 4: what is are XML instance and data documents, and what
is the difference with XML documents? Both notions should be defined.
15- section 4: why limit yourself to combination with only one XML
document? In fact, the Core syntax does not have this limitation, so
it is unclear how
16- a RIF document is interpreted using a semantic multi-structure,
not a semantic structure. This needs to be taken into account in the
definitions in section 4.
17- notions of consistency and entailment, based on combined
interpretations, need to be defined for RIF+XML combinations. Stating
that these notions remain unchanged from Core does not work, since you
do not have Core structures, but combined interpretations here.
18- section 4.1, 4th paragraph: constants are not "in" any lexical
space. Constants have the form l^^s, where l is a string and s an IRI
denoting a symbol space.
19- section 4.1.1, first bullet: the definition of string-matches is a
bit hard to read and overly restrictive (e.g., it does not account for
rdf:PlainLiterals without language tags). I would suggest to either
match L_dt(c) (here, L_dt is the lexical-to-value mapping of the
datatype of c) with [string value] or, better yet, just give a
semantic definition: a string s string-matches i iff s=[string value]
after white space normalization [of both s and [string value], I
presume]. Similar for the second bullet.
20- definition in sec 4.1.1, 2.: the condition does not take frame
formulas with multiple attributes, nor equality between IRIs into
account. I would suggest to work on the semantic level, giving the
definition in terms of domain elements and the I_frame mapping. Also,
when speaking about domain values, you can speak directly of strings,
rather than strings obtained from constants. Similar for bullet 3 and
the corresponding bullets in the definition in sec 4.1.2. In addition,
when using a semantic definition in sec 4.1.2, you no longer need to
do type matching; all you need to do is require that the value on the
RIF side is equal to [typed value], when discarding the type label.
21- section 4.1.3: what is the operational semantics of Core? It's not
in the Core spec.
22- definition in section 4.1.2: the first condition in both 3a and 3b
(the existence of a corresponding element in the XSD) seems redundant,
since I_DM is based on a PSVI, and so must be schema-valid. Is that
true?
23- definition in section 4.1.2: right now I cannot foresee the
consequences of condition 4. It seems that including all possible XML
datatypes is a problem, for example we already identified that the
duration datatype poses a problem for RIF. The question is whether
there are possible other datatypes that pose problems. Datatypes that
are derived from types that are in RIF do not need to be included in
DTS, since their value spaces are are necessarily subsets of D_Ind and
there are syntactic representations of all the values.
For this round of publication, I would suggest to add at least an
editor's note saying that the condition will be further refined in
future versions.

Editorial comments:

101- Sec 3.1, 4th paragraph: references should be included that
explain what general and external parsed entities are and how they are
expanded
102- There is a definition of an "instance of the data model", but not
of the data model. Given that there is no such definition, I think it
unwise to speak about instances of it, since this only makes the spec
harder to understand
103- Section 4, first paragraph: why introduce the additional term
"interpretation" here? I would suggest to stick with the term
"structure", as in the other RIF specs.
104- editor's note just above sec 4.1.1: yes, I think it should be
said explicitly
105- definition in section 4.1.1: the notation {I_DM} is somewhat
redundant with the requirement in the definition that all references
in I_DM have been resolved

Further questions:

1001- Is it true that it is guaranteed that every element and every
attribute has a type in a PSVI infoset? In a schema it is possible to
write such vague things as xs:any, thereby not actually specifying the
type of a particular element.

-- 
Jos de Bruijn
  Web:          http://www.debruijn.net/
  LinkedIn:     http://at.linkedin.com/in/josdebruijn
Received on Tuesday, 15 June 2010 12:09:03 UTC