Re: IRIs

On Wed, 14 Mar 2018 11:17:13 +0100, Steven Pemberton  
<steven.pemberton@cwi.nl> wrote:

> I guess what I meant, and was probably being too optimistic, is that the  
> syntax in RFC 3987 is about parsing, and therefore contains several  
> rules that are only there for semantic reasons (and a couple that are  
> apparently there for no reason at all; for instance 'ipath' doesn't seem  
> to be used anywhere), and that led me to suspect that they could be  
> combined so that the same strings were still recognised, while losing  
> the semantic information, and thus reduce the total size of the  
> necessary regexp.

Here is a small example of that:

   ihost          = IP-literal / IPv4address / ireg-name

The string 192.168.0.1 matches both IPv4address and ireg-name. so there is  
no need for the rule for IPv4address.

Steven


>
> But on further study, I think I was indeed overly optimistic. I agree  
> that 'ihost' is a dragon (and actually ought to be larger, since they  
> seemed to drop the ball on 'ireg-name'), and the huge size of any rule  
> for the character sets don't help either.
>
> However, despite the large size of the necessary regexp, I still think  
> it is a necessary and useful addition to XForms to define proper  
> validity checks for URIs/IRIs.
>
> Steven
>
> On Mon, 12 Mar 2018 18:00:37 +0100, C. M. Sperberg-McQueen  
> <cmsmcq@blackmesatech.com> wrote:
>
>>
>>> On Mar 11, 2018, at 1:51 PM, Steven Pemberton  
>>> <steven.pemberton@cwi.nl> wrote:
>>>
>>> Thanks for this. A solid and thorough piece of work.
>>>
>>> However, I think our problem is a little simpler; we don't need to  
>>> parse the URI, only recognise if it is correct or not. This means we  
>>> can greatly simplify the syntax.
>>
>> The only thing the types defined in that document do is
>> recognize whether the input value is correct or not.  By
>> ‘correct’ I mean (and I assume you also mean) ‘recognized
>> by the grammar in the spec’.
>>
>> There are plenty of simplifications of the syntax around, but
>> they don’t recognize the set of strings generated by the grammar.
>> The regular expression in Annex B of 3986, for example, can
>> be used to recognize the gross structure of a string known to be
>> a correct URI (or perhaps IRI), but on examination it turns
>> out to accept any string of characters, so it does not distinguish
>> correct from incorrect.
>>
>>>
>>> As far as I can see, the basis of the regex needed is:
>>>
>>>   IRI-reference  = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [  
>>> "#" ipch-f* ]
>>>
>>> where the only difference between 'hier' and 'hier-nc' is that hier-nc  
>>> may have no colons before the first (if any) "/" character.
>>>
>>> (The only difference between the characters represented by 'ipch-q'  
>>> and 'ipch-f' is that ipch-q can contain characters from the private  
>>> use areas.)
>>>
>>> As I see it, XForms needs two IRI types. For the case where a user is  
>>> required to type in a full web-address:
>>>
>>>   IRI            = scheme ":" ihier-part [ "?" iquery ] [ "#"  
>>> ifragment ]
>>>
>>> and where data could hold either a full IRI or a relative IRI:
>>>
>>>   IRI-reference  = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [  
>>> "#" ipch-f* ]
>>
>>
>> It’s not clear to me whether you are proposing to simplify the
>> grammar (1) by relaxing some of its constraints or (2) by omitting
>> non-terminals (as in the treatment of iquery and ifragment in your
>> definition of IRI-reference) and replacing complex expressions
>> with simpler expressions which recognize exactly the same
>> languages.
>>
>> In the first case, the result will not in fact be checking URIs
>> or IRIs for correctness, so on reflection I assume that that cannot
>> be what you have in mind.
>>
>> In the second case, you have the burden of proving the equivalence
>> between the grammars in the RFCs and the expressions you are
>> constructing, but you may be able to produce final regular expressions
>> which are simpler than those in the unpublished WG Note.
>>
>> The Note performs a few simplifications here and there but does
>> not attempt any broad restructuring of the grammar, since one of
>> its purposes is to make it easy to confirm that the type defined is
>> correct and accepts the same strings as the grammars in the RFCs.
>>
>> It might be possible to simplify things a great deal by restructuring
>> the grammar, though I believe the largest contribution to the
>> complexity of the grammar is currently made by the definition of
>> ihost, which I don’t see a particularly good way to simplify.
>>
>> Bear in mind that the entity names used to construct the regex
>> disappear without a trace; simplification of the grammar by eliminating
>> non-terminals like ifragment and iquery will thus have no effect on
>> the complexity of the final expression.
>>
>> Michael
>>
>> ********************************************
>> C. M. Sperberg-McQueen
>> Black Mesa Technologies LLC
>> cmsmcq@blackmesatech.com
>> http://www.blackmesatech.com
>> ********************************************
>>

Received on Wednesday, 14 March 2018 11:10:09 UTC