- From: Steven Pemberton <steven.pemberton@cwi.nl>
- Date: Sun, 11 Mar 2018 20:51:44 +0100
- To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- Cc: public-xformsusers@w3.org
Thanks for this. A solid and thorough piece of work. However, I think our problem is a little simpler; we don't need to parse the URI, only recognise if it is correct or not. This means we can greatly simplify the syntax. As far as I can see, the basis of the regex needed is: IRI-reference = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ "#" ipch-f* ] where the only difference between 'hier' and 'hier-nc' is that hier-nc may have no colons before the first (if any) "/" character. (The only difference between the characters represented by 'ipch-q' and 'ipch-f' is that ipch-q can contain characters from the private use areas.) As I see it, XForms needs two IRI types. For the case where a user is required to type in a full web-address: IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ] and where data could hold either a full IRI or a relative IRI: IRI-reference = (scheme ":" [hier] | [hier-nc]) [ "?" ipch-q* ] [ "#" ipch-f* ] In passing I note one great failing of the syntax in RFC 3987: while it goes to great lengths to define an IP address as exactly 4 numbers, separated by dots, where the numbers are between 0 and 255, and similarly for an IPv6 address, when it comes to what it calls an 'ireg-name' (representing things such as www.w3.org), it allows any old rubbish through, including things like "..." "}{" and "+-.._". Adopting a type in XForms, I would be inclined to use the definition used for email, which is one or more 'subdomains' separated by ".". Steven On Thu, 08 Mar 2018 18:56:28 +0100, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote: > The tighter types defined in [1] may be what you want. > > [1] > https://www.w3.org/XML/Group/2004/06/exacturi/xsd-rfc-3986-uri-3986-iri.html > > Actually, though, since it appears the access privileges for /XML/Group > have been made more restrictive, it's not clear that the document at [1] > is available to anyone outside the Team anymore. So I attach a copy, > which I have munged to try to make it display plausibly from the > lists.w3.org archives. > > The actual definitions of the types appear to accessible in [2], [3], > and [4]. > > [2] http://www.w3.org/2011/04/XMLSchema/TypeLibrary-IRI-URI-driver.xsd > [3] http://www.w3.org/2011/04/XMLSchema/TypeLibrary-URI-RFC3986.xsd > [4] http://www.w3.org/2011/04/XMLSchema/TypeLibrary-IRI-RFC3987.xsd > > Note that these are marked as drafts, and contain text saying that the > current version of the types is in the schema datatype library at [5], > which > is not true: since the draft note was never published by the Schema WG, > the types were never added to the type library. If XForms wants to use > them, you should probably re-issue them by revising the schema documents > and publishing them in an appropriate location. (And if you want to > publish [1] as a group document, that would probably be useful for > those who need to understand how the schema documents are > constructed.) > > Michael > > > On Mar 8, 2018, at 7:06 AM, Steven Pemberton wrote: > >> But that said, the anyURI type is extremely liberal, accepting >> literally *any* string of characters. The only purpose of the type >> seems to be mandating transforming the characters into something >> acceptable to a URI when necessary. >> >> It would still be useful to have a type that validates according to >> http://www.ietf.org/rfc/rfc3987.txt. >> >> Steven >> >> >> On Thu, 08 Mar 2018 13:46:33 +0100, Steven Pemberton >> <steven.pemberton@cwi.nl> wrote: >> >>> You are absolutely right, and I am absolutely wrong. >>> >>> What led me to the conclusion was writing the test suite for anyURI, >>> and IRIs showing up as invalid, and me then following the wrong link. >>> >>> So all is well, I can breathe a sigh of relief, and carry on with the >>> test suite. >>> >>> I'm happy that you are reading the XForms mailing list :-) >>> >>> Steven >>> >>> On Wed, 07 Mar 2018 19:37:20 +0100, C. M. Sperberg-McQueen >>> <cmsmcq@blackmesatech.com> wrote: >>> >>>> >>>>> On Mar 7, 2018, at 10:39 AM, Steven Pemberton >>>>> <steven.pemberton@cwi.nl> wrote: >>>>> >>>>> The definition of anyURI doesn't allow IRIs, such as >>>>> https://zh.wikipedia.org/wiki/Wikipedia:关于中文维基百科/en. >>>>> >>>>> Just as we added an iemail address type to match modern email >>>>> addresses, it seems to me that we ought to also add an anyIRI type >>>>> that accepts IRIs like the aboce. >>>> >>>> I am puzzled; what leads you to the conclusion that xsd:anyURI >>>> does not accept IRIs? >>>> >>>> In XSD 1.0 [1], the value space is described as that of >>>> RFC 2396, as modified by RFC 2732, and the lexical space >>>> is described (roughly) as the set of strings, which after >>>> escaping, turn into URIs as defined by those specs. The >>>> escaping in question is the then current algorithm for IRIs, >>>> as published in the XLink spec. I believe that later revisions >>>> of the concept of IRI changed the rules for whitespace, but >>>> I don’t recall any other changes likely to be noticeable to >>>> users of the datatype. Certainly the intent of XSD 1.0 >>>> was to accept IRIs in the lexical space of the type anyURI. >>>> >>>> The spec says "This [the mapping from lexical space to value >>>> space] means that a wide range of internationalized resource >>>> identifiers can be specified when an anyURI is called for”. >>>> >>>> In XSD 1.1 [2], the spec is a little more explicit, since the >>>> IRI concept was a little more clearly developed by that time: >>>> "anyURI represents an Internationalized Resource Identifier >>>> Reference (IRI). An anyURI value can be absolute or relative, >>>> and may have an optional fragment identifier (i.e., it may be >>>> an IRI Reference). This type should be used when the value >>>> fulfills the role of an IRI, as defined in [RFC 3987] or its >>>> successor(s) in the IETF Standards Track.” >>>> >>>> During the development of XSD 1.1 the WG responded to >>>> inconsistencies in the 1.0 implementations of the anyURI >>>> type (and, perhaps, to fears that future revisions of the RFCs >>>> for URIs and IRIs would continue to change the set of legal >>>> values) by seeking to simplify and future-proof the rules used >>>> for checking schema-validity of IRIs. For reasons I do not think >>>> I can successfully reconstruct (at least, not without falling >>>> into depression), it chose to do so by stating clearly that the >>>> grammar rules specified by the relevant RFCs are effectively >>>> only advisory, and that for purposes of schema validation, >>>> any sequence of XML characters constitutes a value of the >>>> type. >>>> >>>> So in XSD 1.1 it is doubly untrue to say that IRIs are not >>>> accepted as lexical representations of xsd:anyURI: not only >>>> is it clearly stated that IRIs are to be accepted, but strings >>>> that do not match the current definition of IRIs will *also* >>>> be accepted as schema-valid. >>>> >>>> XForms needs its own IRI type only if stricter validation of the >>>> grammar of URIs and IRIs is needed. >>>> >>>> If in fact stricter validation is needed, the XForms group may >>>> wish to consider using the datatypes defined in “XSD datatypes >>>> for strict validation of IRIs and URIs” [3]. >>>> >>>> It would be very disappointing if the amount of work that went >>>> into making xsd:anyURI accept IRIs turned out to be for >>>> naught. >>>> >>>> [1] https://www.w3.org/TR/xmlschema-2/#anyURI >>>> [2] https://www.w3.org/TR/xmlschema11-2/#anyURI >>>> [3] >>>> https://www.w3.org/XML/Group/2004/06/exacturi/xsd-rfc-3986-uri-3986-iri.html >>>> >>>> N.B. I am umable to verify URI [3], since my access privileges >>>> no longer seem sufficient to retrieve the document. [3] was >>>> prepared for publication as a WG note by the then XML Schema >>>> WG but never published, since the WG ran out of resources and >>>> time. When the XML Core WG took over responsibility for >>>> XSD, they decided they didn’t have the necessary resources, either. >>>> I would be glad if the work were finally published. >>>> >>>> ******************************************** >>>> C. M. Sperberg-McQueen >>>> Black Mesa Technologies LLC >>>> cmsmcq@blackmesatech.com >>>> http://www.blackmesatech.com >>>> ******************************************** >>>>
Received on Sunday, 11 March 2018 19:52:19 UTC