[BLD] PS specs amendments

Hello,

This is an update on my on-going efforts to produce a working parser and
XML serializer for the BLD Presentation Syntax: Action 564, due on
October 31, 2008 (http://www.w3.org/2005/rules/wg/track/actions/564).

I had already produced such a thing for the original specs - i.e., before
several changes were made that have had the effect of introducing several
rather nasty ambiguities and context sensitivity, both at the lexical and
syntactic levels - even just for the canonical PS (i.e., even w/o the DTB
shortcuts and Adrian's Abridged PS).

I have been struggling trying to find workarounds to whatever snags have
popped up whenever I could figure any. However, there still remain some
tricky situations that require our attention (at least so that we produce
specs that are not so uselessly complicated to implement without ad hoc
hacks).

It would be good that the PS Task Force convene sometime soon to discuss
these issues and how to resolve them.

Here are some examples of what I have puzzled over (this is non exhaustive):

1) Tokenizing the argument of the Prefix and base directives is made
   uselessy complex by not enclosing the IRI in double quotes (viz., it
   forces a lexer to *parse* IRI's - as opposed to just read them off -
   for no purpose whatsoever, making the lexical nature of some characters
   context-sensitive (for example, ':' is used as a delimiter for CURIE's
   but not within IRI's; or, '#' is used as class membership, but not
   within IRI's; etc, ...).

   A possible workaround is simply to double-quote them in the directives.

2) The minus sign ('-') now appears in some identifiers (e.g., ?diffdays =
   External(func:days-from-duration(?diffduration)). This would be no
   problem if we just considered '-' to be part of identifiers like '_',
   but it must also be seen as a literal character in order to recognize
   tokens such as "->" and ":-". While this is not a major hitch, it is
   unnecessary. (Not to mention the fact that '-' is the subtraction
   operator in the APS.)
  
   A possible workaround is simply to disallow '-' in identifiers (say,
   using '_' instead) - as is the case in most programming languages.

3) The ANGLEBRACKIRI notation can be dealt with declaring '<' and '>' as
   quote chars, but this precludes them from being used as operators or
   punctuation.

4) UNITERM's are defined to be either positional or attributed, but not
   both:

       UNITERM ::= Const '(' (TERM* | (Name '->' TERM)*) ')'

   This creates an inherently unliftable reduce/reduce syntactic ambiguity:

       =============================
       STATE NUMBER: 54
       =============================
       This state has conflicts:

       Unresolved R/R conflict: choosing R82	over R84, 	on input 'IDENTIFIER'
       Unresolved R/R conflict: choosing R82	over R84, 	on input 'CLOSEPAR'
       -----------------------------
       [45] UniTerm --> Const 'OPENPAR' . UniTermBody 'CLOSEPAR'
	       Preceding states: {22, 51, 95, 120, 127, 148}
	       Follow set: {'CLOSEPAR'}
       [66] UniTermBody --> . Term_star
	       Preceding states: {54}
       [67] UniTermBody --> . TermAttribute_star
	       Preceding states: {54}
       [82] Term_star --> .
	       Preceding states: {54}
	       Lookahead set: {'EXTERNAL', 'NUMBER', 'LOCALNAME', 'VARIABLE', 'STRING', 'IDENTIFIER', 'ANGLEBRACKIRI', 'CLOSEPAR', 'OPENMETA', 'COLON'}
       [83] Term_star --> . Term_star Term
	       Preceding states: {54}
       [84] TermAttribute_star --> .
	       Preceding states: {54}
	       Lookahead set: {'IDENTIFIER', 'CLOSEPAR'}
       [85] TermAttribute_star --> . TermAttribute_star TermAttribute
	       Preceding states: {54}
       -----------------------------
       With UniTermBody, go to state 55
       With Term_star, go to state 56
       With TermAttribute_star, go to state 57

    A possible workaround is to use modify the rule to:

       UNITERM ::= Const '(' (TERM | (Name '->' TERM))* ')'

    (i.e., accepting mixed positional and attributed term bodies), and
    perform a check ex post facto.

5) According to http://www.w3.org/TR/rif-bld/#sec-ebnf-condition-language:

      An IRICONST is the special case of a Const with the symbol
      space rif:iri, again permitting the shortcut forms defined in
      http://www.w3.org/TR/rif-bld/#ref-rif-dtb. One such specialization
      is '"' IRI '"^^' 'rif:iri' from the Const production, where IRI is a
      sequence of Unicode characters that forms an internationalized
      resource identifier as defined by http://www.w3.org/TR/rif-bld/#ref-rfc-3987.
      
    However, this definition complicates tokenizing as it becomes
    impossible to distinguish the special case from the general one.

    A possible workaround is to see an IRICONST as just a fully qualified
    constant; i.e., accepting even not "rif:iri" symbol spaces and
    performing the check ex post fact.

Again, this is not an exhaustive list of issues. Be those as they may, I
will continue working on trying to produce a working [A]PS parser as my
time permits while on the road (I have been traveling and will be until
Nov. 11).

It will be good that the PS Task Force discuss and find resolutions to all
such issues.

Regards,

-hak
--
Hassan Aït-Kaci  *  ILOG, Inc. - Product Division R&D
http://koala.ilog.fr/wiki/bin/view/Main/HassanAitKaci

Received on Thursday, 30 October 2008 17:28:04 UTC