- From: Hassan Ait-Kaci <hak@ilog.com>
- Date: Thu, 30 Oct 2008 10:27:21 -0700
- To: <public-rif-wg@w3.org>
- Message-ID: <9FC9C6B2EA71ED4B826F55AC7C8B9AAB0C3E6F67@mvmbx01.ilog.biz>
Hello, This is an update on my on-going efforts to produce a working parser and XML serializer for the BLD Presentation Syntax: Action 564, due on October 31, 2008 (http://www.w3.org/2005/rules/wg/track/actions/564). I had already produced such a thing for the original specs - i.e., before several changes were made that have had the effect of introducing several rather nasty ambiguities and context sensitivity, both at the lexical and syntactic levels - even just for the canonical PS (i.e., even w/o the DTB shortcuts and Adrian's Abridged PS). I have been struggling trying to find workarounds to whatever snags have popped up whenever I could figure any. However, there still remain some tricky situations that require our attention (at least so that we produce specs that are not so uselessly complicated to implement without ad hoc hacks). It would be good that the PS Task Force convene sometime soon to discuss these issues and how to resolve them. Here are some examples of what I have puzzled over (this is non exhaustive): 1) Tokenizing the argument of the Prefix and base directives is made uselessy complex by not enclosing the IRI in double quotes (viz., it forces a lexer to *parse* IRI's - as opposed to just read them off - for no purpose whatsoever, making the lexical nature of some characters context-sensitive (for example, ':' is used as a delimiter for CURIE's but not within IRI's; or, '#' is used as class membership, but not within IRI's; etc, ...). A possible workaround is simply to double-quote them in the directives. 2) The minus sign ('-') now appears in some identifiers (e.g., ?diffdays = External(func:days-from-duration(?diffduration)). This would be no problem if we just considered '-' to be part of identifiers like '_', but it must also be seen as a literal character in order to recognize tokens such as "->" and ":-". While this is not a major hitch, it is unnecessary. (Not to mention the fact that '-' is the subtraction operator in the APS.) A possible workaround is simply to disallow '-' in identifiers (say, using '_' instead) - as is the case in most programming languages. 3) The ANGLEBRACKIRI notation can be dealt with declaring '<' and '>' as quote chars, but this precludes them from being used as operators or punctuation. 4) UNITERM's are defined to be either positional or attributed, but not both: UNITERM ::= Const '(' (TERM* | (Name '->' TERM)*) ')' This creates an inherently unliftable reduce/reduce syntactic ambiguity: ============================= STATE NUMBER: 54 ============================= This state has conflicts: Unresolved R/R conflict: choosing R82 over R84, on input 'IDENTIFIER' Unresolved R/R conflict: choosing R82 over R84, on input 'CLOSEPAR' ----------------------------- [45] UniTerm --> Const 'OPENPAR' . UniTermBody 'CLOSEPAR' Preceding states: {22, 51, 95, 120, 127, 148} Follow set: {'CLOSEPAR'} [66] UniTermBody --> . Term_star Preceding states: {54} [67] UniTermBody --> . TermAttribute_star Preceding states: {54} [82] Term_star --> . Preceding states: {54} Lookahead set: {'EXTERNAL', 'NUMBER', 'LOCALNAME', 'VARIABLE', 'STRING', 'IDENTIFIER', 'ANGLEBRACKIRI', 'CLOSEPAR', 'OPENMETA', 'COLON'} [83] Term_star --> . Term_star Term Preceding states: {54} [84] TermAttribute_star --> . Preceding states: {54} Lookahead set: {'IDENTIFIER', 'CLOSEPAR'} [85] TermAttribute_star --> . TermAttribute_star TermAttribute Preceding states: {54} ----------------------------- With UniTermBody, go to state 55 With Term_star, go to state 56 With TermAttribute_star, go to state 57 A possible workaround is to use modify the rule to: UNITERM ::= Const '(' (TERM | (Name '->' TERM))* ')' (i.e., accepting mixed positional and attributed term bodies), and perform a check ex post facto. 5) According to http://www.w3.org/TR/rif-bld/#sec-ebnf-condition-language: An IRICONST is the special case of a Const with the symbol space rif:iri, again permitting the shortcut forms defined in http://www.w3.org/TR/rif-bld/#ref-rif-dtb. One such specialization is '"' IRI '"^^' 'rif:iri' from the Const production, where IRI is a sequence of Unicode characters that forms an internationalized resource identifier as defined by http://www.w3.org/TR/rif-bld/#ref-rfc-3987. However, this definition complicates tokenizing as it becomes impossible to distinguish the special case from the general one. A possible workaround is to see an IRICONST as just a fully qualified constant; i.e., accepting even not "rif:iri" symbol spaces and performing the check ex post fact. Again, this is not an exhaustive list of issues. Be those as they may, I will continue working on trying to produce a working [A]PS parser as my time permits while on the road (I have been traveling and will be until Nov. 11). It will be good that the PS Task Force discuss and find resolutions to all such issues. Regards, -hak -- Hassan Aït-Kaci * ILOG, Inc. - Product Division R&D http://koala.ilog.fr/wiki/bin/view/Main/HassanAitKaci
Received on Thursday, 30 October 2008 17:28:04 UTC