Re: IRIs from Steven Pemberton on 2018-03-14 (public-xformsusers@w3.org from March 2018)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Wed, 14 Mar 2018 11:17:13 +0100
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Cc: public-xformsusers@w3.org
Message-ID: <op.zfu2uzuosmjzpq@steven-xps>
I guess what I meant, and was probably being too optimistic, is that the  
syntax in RFC 3987 is about parsing, and therefore contains several rules  
that are only there for semantic reasons (and a couple that are apparently  
there for no reason at all; for instance 'ipath' doesn't seem to be used  
anywhere), and that led me to suspect that they could be combined so that  
the same strings were still recognised, while losing the semantic  
information, and thus reduce the total size of the necessary regexp.

But on further study, I think I was indeed overly optimistic. I agree that  
'ihost' is a dragon (and actually ought to be larger, since they seemed to  
drop the ball on 'ireg-name'), and the huge size of any rule for the  
character sets don't help either.

However, despite the large size of the necessary regexp, I still think it  
is a necessary and useful addition to XForms to define proper validity  
checks for URIs/IRIs.

Steven

On Mon, 12 Mar 2018 18:00:37 +0100, C. M. Sperberg-McQueen  
<cmsmcq@blackmesatech.com> wrote:

>
>> On Mar 11, 2018, at 1:51 PM, Steven Pemberton <steven.pemberton@cwi.nl>  
>> wrote:
>>
>> Thanks for this. A solid and thorough piece of work.
>>
>> However, I think our problem is a little simpler; we don't need to  
>> parse the URI, only recognise if it is correct or not. This means we  
>> can greatly simplify the syntax.
>
> The only thing the types defined in that document do is
> recognize whether the input value is correct or not.  By
> ‘correct’ I mean (and I assume you also mean) ‘recognized
> by the grammar in the spec’.
>
> There are plenty of simplifications of the syntax around, but
> they don’t recognize the set of strings generated by the grammar.
> The regular expression in Annex B of 3986, for example, can
> be used to recognize the gross structure of a string known to be
> a correct URI (or perhaps IRI), but on examination it turns
> out to accept any string of characters, so it does not distinguish
> correct from incorrect.
>
>>
>> As far as I can see, the basis of the regex needed is:
>>
>>   IRI-reference  = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [  
>> "#" ipch-f* ]
>>
>> where the only difference between 'hier' and 'hier-nc' is that hier-nc  
>> may have no colons before the first (if any) "/" character.
>>
>> (The only difference between the characters represented by 'ipch-q' and  
>> 'ipch-f' is that ipch-q can contain characters from the private use  
>> areas.)
>>
>> As I see it, XForms needs two IRI types. For the case where a user is  
>> required to type in a full web-address:
>>
>>   IRI            = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment  
>> ]
>>
>> and where data could hold either a full IRI or a relative IRI:
>>
>>   IRI-reference  = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [  
>> "#" ipch-f* ]
>
>
> It’s not clear to me whether you are proposing to simplify the
> grammar (1) by relaxing some of its constraints or (2) by omitting
> non-terminals (as in the treatment of iquery and ifragment in your
> definition of IRI-reference) and replacing complex expressions
> with simpler expressions which recognize exactly the same
> languages.
>
> In the first case, the result will not in fact be checking URIs
> or IRIs for correctness, so on reflection I assume that that cannot
> be what you have in mind.
>
> In the second case, you have the burden of proving the equivalence
> between the grammars in the RFCs and the expressions you are
> constructing, but you may be able to produce final regular expressions
> which are simpler than those in the unpublished WG Note.
>
> The Note performs a few simplifications here and there but does
> not attempt any broad restructuring of the grammar, since one of
> its purposes is to make it easy to confirm that the type defined is
> correct and accepts the same strings as the grammars in the RFCs.
>
> It might be possible to simplify things a great deal by restructuring
> the grammar, though I believe the largest contribution to the
> complexity of the grammar is currently made by the definition of
> ihost, which I don’t see a particularly good way to simplify.
>
> Bear in mind that the entity names used to construct the regex
> disappear without a trace; simplification of the grammar by eliminating
> non-terminals like ifragment and iquery will thus have no effect on
> the complexity of the final expression.
>
> Michael
>
> ********************************************
> C. M. Sperberg-McQueen
> Black Mesa Technologies LLC
> cmsmcq@blackmesatech.com
> http://www.blackmesatech.com
> ********************************************
>
Received on Wednesday, 14 March 2018 10:17:47 UTC