Re: IRIs from C. M. Sperberg-McQueen on 2018-03-12 (public-xformsusers@w3.org from March 2018)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Mon, 12 Mar 2018 11:00:37 -0600
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-xformsusers@w3.org
Message-Id: <BB18E33C-382A-46C8-A519-0977953463B8@blackmesatech.com>

> On Mar 11, 2018, at 1:51 PM, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
> Thanks for this. A solid and thorough piece of work.
> 
> However, I think our problem is a little simpler; we don't need to parse the URI, only recognise if it is correct or not. This means we can greatly simplify the syntax.

The only thing the types defined in that document do is 
recognize whether the input value is correct or not.  By
‘correct’ I mean (and I assume you also mean) ‘recognized
by the grammar in the spec’.  

There are plenty of simplifications of the syntax around, but 
they don’t recognize the set of strings generated by the grammar.  
The regular expression in Annex B of 3986, for example, can
be used to recognize the gross structure of a string known to be
a correct URI (or perhaps IRI), but on examination it turns
out to accept any string of characters, so it does not distinguish
correct from incorrect.

> 
> As far as I can see, the basis of the regex needed is:
> 
>   IRI-reference  = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ "#" ipch-f* ]
> 
> where the only difference between 'hier' and 'hier-nc' is that hier-nc may have no colons before the first (if any) "/" character.
> 
> (The only difference between the characters represented by 'ipch-q' and 'ipch-f' is that ipch-q can contain characters from the private use areas.)
> 
> As I see it, XForms needs two IRI types. For the case where a user is required to type in a full web-address:
> 
>   IRI            = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ]
> 
> and where data could hold either a full IRI or a relative IRI:
> 
>   IRI-reference  = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ "#" ipch-f* ]

It’s not clear to me whether you are proposing to simplify the 
grammar (1) by relaxing some of its constraints or (2) by omitting 
non-terminals (as in the treatment of iquery and ifragment in your 
definition of IRI-reference) and replacing complex expressions 
with simpler expressions which recognize exactly the same 
languages.

In the first case, the result will not in fact be checking URIs 
or IRIs for correctness, so on reflection I assume that that cannot
be what you have in mind.  

In the second case, you have the burden of proving the equivalence 
between the grammars in the RFCs and the expressions you are 
constructing, but you may be able to produce final regular expressions
which are simpler than those in the unpublished WG Note.

The Note performs a few simplifications here and there but does
not attempt any broad restructuring of the grammar, since one of
its purposes is to make it easy to confirm that the type defined is
correct and accepts the same strings as the grammars in the RFCs.

It might be possible to simplify things a great deal by restructuring
the grammar, though I believe the largest contribution to the
complexity of the grammar is currently made by the definition of
ihost, which I don’t see a particularly good way to simplify.

Bear in mind that the entity names used to construct the regex
disappear without a trace; simplification of the grammar by eliminating
non-terminals like ifragment and iquery will thus have no effect on 
the complexity of the final expression.

Michael

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************

Received on Monday, 12 March 2018 17:01:05 UTC