- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Wed, 14 Mar 2018 12:09:22 +0100
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, "Steven Pemberton" <steven.pemberton@cwi.nl>
- Cc: public-xformsusers@w3.org
On Wed, 14 Mar 2018 11:17:13 +0100, Steven Pemberton <steven.pemberton@cwi.nl> wrote: > I guess what I meant, and was probably being too optimistic, is that the > syntax in RFC 3987 is about parsing, and therefore contains several > rules that are only there for semantic reasons (and a couple that are > apparently there for no reason at all; for instance 'ipath' doesn't seem > to be used anywhere), and that led me to suspect that they could be > combined so that the same strings were still recognised, while losing > the semantic information, and thus reduce the total size of the > necessary regexp. Here is a small example of that: ihost = IP-literal / IPv4address / ireg-name The string 192.168.0.1 matches both IPv4address and ireg-name. so there is no need for the rule for IPv4address. Steven > > But on further study, I think I was indeed overly optimistic. I agree > that 'ihost' is a dragon (and actually ought to be larger, since they > seemed to drop the ball on 'ireg-name'), and the huge size of any rule > for the character sets don't help either. > > However, despite the large size of the necessary regexp, I still think > it is a necessary and useful addition to XForms to define proper > validity checks for URIs/IRIs. > > Steven > > On Mon, 12 Mar 2018 18:00:37 +0100, C. M. Sperberg-McQueen > <cmsmcq@blackmesatech.com> wrote: > >> >>> On Mar 11, 2018, at 1:51 PM, Steven Pemberton >>> <steven.pemberton@cwi.nl> wrote: >>> >>> Thanks for this. A solid and thorough piece of work. >>> >>> However, I think our problem is a little simpler; we don't need to >>> parse the URI, only recognise if it is correct or not. This means we >>> can greatly simplify the syntax. >> >> The only thing the types defined in that document do is >> recognize whether the input value is correct or not. By >> ‘correct’ I mean (and I assume you also mean) ‘recognized >> by the grammar in the spec’. >> >> There are plenty of simplifications of the syntax around, but >> they don’t recognize the set of strings generated by the grammar. >> The regular expression in Annex B of 3986, for example, can >> be used to recognize the gross structure of a string known to be >> a correct URI (or perhaps IRI), but on examination it turns >> out to accept any string of characters, so it does not distinguish >> correct from incorrect. >> >>> >>> As far as I can see, the basis of the regex needed is: >>> >>> IRI-reference = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ >>> "#" ipch-f* ] >>> >>> where the only difference between 'hier' and 'hier-nc' is that hier-nc >>> may have no colons before the first (if any) "/" character. >>> >>> (The only difference between the characters represented by 'ipch-q' >>> and 'ipch-f' is that ipch-q can contain characters from the private >>> use areas.) >>> >>> As I see it, XForms needs two IRI types. For the case where a user is >>> required to type in a full web-address: >>> >>> IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" >>> ifragment ] >>> >>> and where data could hold either a full IRI or a relative IRI: >>> >>> IRI-reference = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ >>> "#" ipch-f* ] >> >> >> It’s not clear to me whether you are proposing to simplify the >> grammar (1) by relaxing some of its constraints or (2) by omitting >> non-terminals (as in the treatment of iquery and ifragment in your >> definition of IRI-reference) and replacing complex expressions >> with simpler expressions which recognize exactly the same >> languages. >> >> In the first case, the result will not in fact be checking URIs >> or IRIs for correctness, so on reflection I assume that that cannot >> be what you have in mind. >> >> In the second case, you have the burden of proving the equivalence >> between the grammars in the RFCs and the expressions you are >> constructing, but you may be able to produce final regular expressions >> which are simpler than those in the unpublished WG Note. >> >> The Note performs a few simplifications here and there but does >> not attempt any broad restructuring of the grammar, since one of >> its purposes is to make it easy to confirm that the type defined is >> correct and accepts the same strings as the grammars in the RFCs. >> >> It might be possible to simplify things a great deal by restructuring >> the grammar, though I believe the largest contribution to the >> complexity of the grammar is currently made by the definition of >> ihost, which I don’t see a particularly good way to simplify. >> >> Bear in mind that the entity names used to construct the regex >> disappear without a trace; simplification of the grammar by eliminating >> non-terminals like ifragment and iquery will thus have no effect on >> the complexity of the final expression. >> >> Michael >> >> ******************************************** >> C. M. Sperberg-McQueen >> Black Mesa Technologies LLC >> cmsmcq@blackmesatech.com >> http://www.blackmesatech.com >> ******************************************** >>
Received on Wednesday, 14 March 2018 11:10:09 UTC