SPath and Identity Constraints from Ashok Malhotra on 2000-12-29 (www-xml-schema-comments@w3.org from October to December 2000)

From: Ashok Malhotra <petsa@us.ibm.com>
Date: Fri, 29 Dec 2000 08:29:52 -0500
To: www-xml-schema-comments@w3.org
Cc: XMLStds@us.ibm.com
Message-ID: <OFE7E97497.70647D8D-ON852569C4.0048D59D@pok.ibm.com>
I am forwarding extracts from a note written by Andy Clark, one of our
developers,
that argues that the full power of XPath cannot be supported by streaming
Schema
processors.  He also argues that the full power of XPath is overkill for
identity
constraints and proposes a subset.  Although Andy wrote the note, this is
not one
man's opinion.  A number of us in IBM feel that allowing the full power of
XPath in
identity constraints is counterproductive.

- It makes schema processors unnecessarily complex.
- Cannot be supported by SAX based processors
- Most important, the full power of XPath is not necessary
   for identifying keys.  Only a subset is useful.

All the best, Ashok

---------------------- Forwarded by Ashok Malhotra/Watson/IBM on 12/29/2000
08:18 AM ---------------------------

Andy Clark
12/29/2000 02:19 AM

To:   XMLStds
cc:
From: Andy Clark/Santa Teresa/IBM@IBMUS
Subject:  SPath and Identity Constraints


Subsetting XPath for a Streaming Parser

A streaming (or serial) parser does not buffer any data; the document
information is communicated via document handler callbacks such as
"startElement", "characters", or "endElement" and is then discarded. In
this model, an XPath Selector only has access to the element children, the
element's attributes, and character data. The XPath Selector cannot
traverse the various document axes unless the matcher buffers the document
-- the very situation we are trying to avoid. Therefore it is necessary to
describe a subset of XPath that can be used in a serial manner. We shall
call this subset Serial XPath (SPath).

In short, SPath disallows all axes, predicates, and functions that can not
be supported without buffering a portion of the document. Therefore, the
only axes supported by SPath are attribute, child, and self. Also, the use
of path expressions in predicates is limited to a single step on the
attribute axis. Functions are permitted but restricted. Some functions,
like "id()", can not be supported in SPath.

Some examples of illegal SPaths (but legal XPaths):

  ..
  a//b
  a[b]
  a[position()=last()]

Some examples of legal SPaths:

  .
  /a
  a/b
  @a
  a[@b]
  a[position()=2]

Productions for SPath and the allowed function set are included at the end
of this document.

XML Schema Identity Constraints

XML Schema Identity constraints provide an identity declaration and
referencing mechanism that is more powerful than the ID/IDREF attribute
types provided in DTDs. The identity constraints rely on XPath for
describing the location of the data values in the instance document that
should be considered uniquely or as keys and key references.

Identity constraints are associated with an element declared in the grammar
and are comprised of a "selector" and one or more "fields". Selectors are
relative paths from the declared element and must match a node-set of
elements. Fields are relative paths from the elements matched by the
selector and must match an attribute or an element of simple type.
Uniqueness is determined by comparing the data values matched by the fields
in the value space. Therefore, "1.0" and "1" are equal for decimal data
values.

An abbreviated example from the XML Schema Primer (CR):

  <element name='purchaseReport'>
    <key name='pNumKey'>
      <selector xpath='parts/part'/>
      <field xpath='@number'/>
    </key>
  </element>

In order to provide identity constraint support in an efficient manner,
SPath will be used instead of the full XPath. However, SPath requires
further restrictions to be suitable for use in XML Schema identity
constraints because of the selection requirements of selectors and fields.
We shall call this subset Restricted-SPath.

As previously stated, the selector SPath must select elements and the field
SPath must select attributes or elements of simple type. In addition,
because selectors and fields are relative paths, no SPath used for either a
selector or a field may select the root.

Productions for Restricted-SPath and the allowed function set are included
at the end of this document.

Performance Benefits and Considerations

Only supporting a subset of XPath for XML Schema identity constraints
greatly enhances the performance of the parser. Not buffering document
information reduces the amount of memory required and allows identity
constraints to be supported by streaming parsers implementing interfaces
such as XNI and SAX. Also, the limited number of allowed axes in SPath
reduces the computation time for selecting identity constraint values. So
an implementation of identity constraints using SPath can be very efficient
in both memory usage and computation time.

However, even using SPath for identity constraints degrades parser
performance. The cost is directly related to the number of identity
constraint selectors and fields that are active while parsing the document
and the depth of the element context subtree. This cost comes mainly from
the method invocation overhead for notifying selector and field of document
information.

SPath

Productions

The following list of productions are modifications to those productions
defined in the XPath Specification. Modified productions are denoted with a
prime (') and deleted productions are marked with an 'x'. I dislike the
productions in this grammar, however, because the grammar allows
nonsensical constructions such as "@a/b" and "/..", although the latter is
hard to catch anyway.

While this is a full set of productions of everything that can be supported
by a Serial XPath implementation, there are additional constraints on the
grammar. For example, only the attribute and self axes are allowed within
predicate expressions. However, these restrictions are not introduced into
this modified grammar.

[1]    LocationPath    ::=    RelativeLocationPath     |
AbsoluteLocationPath
[2']    AbsoluteLocationPath    ::=    '/' RelativeLocationPath?
[3']    RelativeLocationPath    ::=    Step     | RelativeLocationPath '/'
Step
[4]    Step    ::=    AxisSpecifier NodeTest Predicate*     |
AbbreviatedStep
[5]    AxisSpecifier    ::=    AxisName '::'     | AbbreviatedAxisSpecifier
[6']    AxisName    ::=    'attribute'     | 'child'     | 'self'
[7]    NodeTest    ::=    NameTest     | NodeType '(' ')'     |
'processing-instruction' '(' Literal ')'
[8]    Predicate    ::=    '[' PredicateExpr ']'
[9]    PredicateExpr    ::=    Expr
[10x]
[11x]
[12']    AbbreviatedStep    ::=    '.'
[13]    AbbreviatedAxisSpecifier    ::=    '@'?
[14]    Expr    ::=    OrExpr
[15']    PrimaryExpr    ::=    '(' Expr ')'     | Literal     | Number
| FunctionCall
[16]    FunctionCall    ::=    FunctionName '(' ( Argument ( ',' Argument )
* )? ')'
[17]    Argument    ::=    Expr
[18]    UnionExpr    ::=    PathExpr     | UnionExpr '|' PathExpr
[19']    PathExpr    ::=    LocationPath     | FilterExpr     | FilterExpr
'/' RelativeLocationPath
[20]    FilterExpr    ::=    PrimaryExpr     | FilterExpr Predicate
[21]    OrExpr    ::=    AndExpr     | OrExpr 'or' AndExpr
[22]    AndExpr    ::=    EqualityExpr     | AndExpr 'and' EqualityExpr
[23]    EqualityExpr    ::=    RelationalExpr     | EqualityExpr '='
RelationalExpr     | EqualityExpr '!=' RelationalExpr
[24]    RelationalExpr    ::=    AdditiveExpr     | RelationalExpr '<'
AdditiveExpr     | RelationalExpr '>' AdditiveExpr     | RelationalExpr '
<=' AdditiveExpr     | RelationalExpr '>=' AdditiveExpr
[25]    AdditiveExpr    ::=    MultiplicativeExpr     | AdditiveExpr '+'
MultiplicativeExpr     | AdditiveExpr '-' MultiplicativeExpr
[26]    MultiplicativeExpr    ::=    UnaryExpr     | MultiplicativeExpr
MultiplyOperator UnaryExpr     | MultiplicativeExpr 'div' UnaryExpr     |
MultiplicativeExpr 'mod' UnaryExpr
[27]    UnaryExpr    ::=    UnionExpr     | '-' UnaryExpr
[28']    ExprToken    ::=    '(' | ')' | '[' | ']' | '.' | '@' | ',' | '::'
| NameTest     | NodeType     | Operator     | FunctionName     | AxisName
| Literal     | Number
[29]    Literal    ::=    '"' [^"]* '"'     | "'" [^']* "'"
[30]    Number    ::=    Digits ('.' Digits?)?     | '.' Digits
[31]    Digits    ::=    [0-9]+
[32']    Operator    ::=    OperatorName     | MultiplyOperator     | '/' |
'|' | '+' | '-' | '=' | '!=' | '<' | '<=' | '>' | '>='
[33]    OperatorName    ::=    'and' | 'or' | 'mod' | 'div'
[34]    MultiplyOperator    ::=    '*'
[35]    FunctionName    ::=    QName - NodeType
[36x]
[37]    NameTest    ::=    '*'     | NCName ':' '*'     | QName
[38]    NodeType    ::=    'comment'     | 'text'     |
'processing-instruction'     | 'node'
[39]    ExprWhitespace    ::=    S

Restricted-SPath

Productions

Since the rules governing what identity constraint limit what selectors and
field expressions are allowed to select and for simplicity, I have
rewritten the productions suitable for Restricted-SPath. I have not made an
effort to keep the names of the productions in line with the standard XPath
grammar definition.

These productions are even further simplified from what is possible to
implement for the reason of further simplicity and ease of understanding.

[1]  Selector ::= SelectorPath  |  SelectorUnion
[2]  Field ::= FieldPath | FieldUnion

[3]  SelectorPath ::= ElementPath
[4]  SelectorUnion ::= SelectorPath '|' SelectorPath

[5]  FieldPath ::= ElementPath | AttributePath
[6]  FieldUnion ::= FieldPath '|' FieldPath

[7]  ElementPath ::= ElementFinalStep  |  Path '/' ElementFinalStep
[8]  ElementUnion ::= ElementPath '|' ElementPath
[9]  ElementStep ::= ( ElementOrSelfAxis Name | ContextNode ) Predicate*
[10]  ElementFinalStep ::= ( ElementOrSelfAxis QName | ContextNode )
Predicate*
[11]  ChildOrSelfAxis ::= ChildAxis?  |  SelfAxis
[12]  ChildAxis ::= 'child' '::'
[13]  SelfAxis ::= 'self' '::'
[14]  ContextNode ::= '.'

[15]  AttributePath ::= AttributeFinalStep  |  Path '/' AttributeFinalStep
[16]  AttributeUnion ::= AttributePath '|' AttributePath
[17]  AttributeStep ::= AttributeAxis Name
[18]  AttributeFinalStep ::= AttributeAxis QName
[19]  AttributeAxis ::= 'attribute' '::'  |  '@'

[20]  Path ::= ElementStep Predicate*  |  Path '/' ElementStep
[21]  Name ::= '*'  |  NCName ':' '*'  |  QName

[22]  Predicate ::= '[' PredicateExpr ']'
[23]  PredicateExpr ::= PositionExpr  |  LiteralExpr
[24]  PositionExpr ::= Number ( '=' 'position()' )?  |  'position()' '='
Number
[25]  LiteralExpr ::= AttributeStep  ( '=' Literal )?  |  Literal '='
AttributeStep
[26]  Number ::= Digit+
[27]  Literal ::= '"' [^"]* '"'  |  "'" [^']* "'"
Received on Friday, 29 December 2000 08:30:08 UTC