XML Query WG Review of XML Infoset Last Call from Michael Rys on 2001-02-22 (www-xml-infoset-comments@w3.org from January to March 2001)

From: Michael Rys <mrys@microsoft.com>
Date: Thu, 22 Feb 2001 10:11:40 -0800
To: "'www-xml-infoset-comments@w3.org'" <www-xml-infoset-comments@w3.org>
Cc: "W3C XML Query WG (E-mail) (E-mail)" <w3c-xml-query-wg@w3.org>
Message-ID: <EC67B042372C27429014D4FB06AC9FAF016BBF98@red-msg-29.redmond.corp.microsoft.com>
Last Call Review of the XML Information Set
===========================================

The following represents the feedback of the XML Query working group on the
XML  Information Set Last Call version [1]. 

In general, the information set needs to strike a balance between describing
too much  detailed but mostly irrelevant information and providing too much
abstraction. The  current working draft has mostly acheived a good balance.
However, it is the opinion of  the reviewers that it has preserved too much
information in certain places.

The following issues are ordered approximately according to the perceived
importance of  the issues to the XQuery working group.

Issue 1: Namespace prefix
--------

According to the namespace specification [2], namespace prefixes have no
semantical  meaning. As such there should be no requirement by the
information set to preserve the  used prefix on element or attribute
information items. For namespace information items,  it may be useful to
preserve the prefix, so that other processors can interpret values  inside
an attribute or element as namespace references.

In addition, the namespace prefix property should not be empty, but absent
as many  systems report already today (see also section 2.15).

Finally, why should it be the prefix part of the element-type in the
attribute  information item?

Scope: Section 2.2 point 3, Section 2.3 point 3, Section 2.15 point 1.

Issue 2: In-scope namespaces
--------

It would probably be better if only the newly defined namespace information
items are  provided on an element information item. This would guarantee
locality and would allow  changing operations on an infoset that has only
local impact. The current definition can  easily be inferred from that
information.

Scope: Section 2.2 point 7.

Issue 3: namespace attributes and xmlns=""
--------

Is xmlns="" represented as a namespace attribute or absent. Since xmlns=""
is  technically not a namespace declaration but an undeclaration, this needs
to be  clarified.

Scope: Section 2.2 point 6.

Issue 4: Character entities
--------

All single characters should be represented as character information items.
Thus, the  predefined entities &lt;, &gt;, &amp;, &apos;, and &quot; do not
need to be represented  with entity information items. They are simply used
to encode the corresponding  character information item and should have no
special semantical standing in the  infoset. Neither should have any of the
numerical character entities. Thus, the internal  entity information items
should preclude information items on character entities.

Scope: Section 2.1 point 5, Section 2.9

Issue 5: Representation of missing information in the Infoset
--------

The specification currently uses NULL to indicate missing properties. Since
the infoset  can make use of a semi-structured data description, there is no
need to make use of a  storage representation that is foreign to the world
of XML. The information set  specification should make use of absence of a
property in such cases.

Thus, we have the following proposal:

Replace Null section with:

Missing and absent information

Some properties may sometimes be absent because they have no defined value
or are not  applicable. This will be expressed by not providing the property
on the information  item.

And replace all: 
if condition(x), then this property is null
with:
if condition(x), then this property is absent.

Scope: Section on Null in intro and all optional properties.

Issue 6: CDATA start and end markers
--------

CDATA sections are a purely syntactical tool to allow the easier
manipulation of  character data that otherwise would need to be entitized.
As such, the infoset should  not preserve CDATA section boundaries.

Basically, <![CDATA[AB]]><![CDATA[C]]> should be equivalent to
<![CDATA[ABC]]>. This is  important since CDATA sections may have to be
broken into two for purely syntactical  reasons (whenever a ]]> occurs).

Scope: Section 2.2 point 4, Section 2.16 and 2.17

Issue 7: normalized attribute value
---------

It may be more useful to provide the value and an indicator whether it was
normalized or  not. It is also not clear how the infoset deals with entities
in attribute values. See  also issue 11 below. Please clarify.

Scope: Section 2.3 point 4

Issue 8: Attribute types and strings
--------
Aren't entities resolved to strings (at the moment)?

Shouldn't the default type be CDATA instead of missing (use absence here if
the answer  is missing).

Scope: Section 2.3 point 6

Issue 9: unexpanded entity in attribute values
--------

Can entities appear in attribute values? If so, the unexpanded entity
reference info  item needs to indicate that. Also entity start and end
markers (if preserved, see issue  10).

Scope: Section 2.5 point 3, Section 2.13, Section 2.14

Issue 10: Entity start and end markers
--------

We would consider the entity start and end markers to be too much preserved
information  for the infoset, assuming that resolved entities are just used
for syntactic purposes.  If they will be preserved, a good usecase scenario
should be provided in the  introduction.

Scope: Section 2.13 and 2.14

Issue 11: Character information item and attribute values
---------

Why are attributes and elements are treated differently w.r.t. character
information  items? Please clarify. See also issue 7.

Scope: Section 2.3 point 4, Section 2.6 point 3.

Issue 12: Document Information Item and document type declaration info item
--------

It is not clear if a document type declaration information item has to be
present if  there is a document type declaration or if it may be present.
Please clarify the  wording.

Scope: Section 2.1 point 1

Issue 13: Standalone Indicator
--------

The standalone indicator should make use of Boolean values instead of yes
and no. Again,  the infoset should not make use of concepts in its
description that is too concrete and  using concepts foreign to XML.

Scope: Section 2.1 point 7 and RDF description

Issue 14: base URI on element information item
--------

It is not clear why this property is needed or useful. Please clarify.

Scope: Section 2.2 point 8.

Issue 15: Attribute types
---------

Given that this property exists and has no meaning in the context of PSV
Infoset, can  the Infoset working group coordinate with the schema working
group why there are now two  properties for types in the PSV-Infoset?

Scope: Section 2.3 point 6

Issue 16: owner element
---------

There is no reason for a separate name. Rename to parent or motivate
different property  name.

Scope: Section 2.3 point 7

Issue 17: RDF Schema
---------

In addition to RDF schema, an XML Schema based non-normative description of
the infoset  would be useful.

Scope: New appendix

Issue 18: Use cases and examples
---------

The spec could benefit from at least one example per information item
section and for  some of the less obvious information items also some use
cases that show why such an  information item (or property) is worth
preserving.

Scope: All of section 2.

Issue 19: Grammatical changes
---------

In the first paragraph of Section 2.2, change 'children' to 'descendents' in
: 
    '...and all other element information items are children of the 
    document element, either directly or indirectly.'
 
In Section 2.2, item 8, change 'may be' to 'is' in : 
    '...entity is not known, this property may be null. '


[1] http://www.w3.org/TR/2001/WD-xml-infoset-20010202
[2] http://www.w3.org/TR/REC-xml-names/

--
Program Manager, SQL Server XML Technologies
mrys@microsoft.com, rys@acm.org
We store the Web and more...
Received on Thursday, 22 February 2001 13:25:09 UTC