XSD datatypes for strict validation of IRIs and URIs

1 Introduction

This document defines two families of datatypes, one designed for strict checking of strings for conformance to the grammar for Internationalized Resource Identifiers (IRIs) defined in [RFC 3987], and the other for checking against the grammar for Uniform Resource Identifiers defined in [RFC 3986]. These datatypes can be used by any conforming XSD 1.0 or XSD 1.1 processor.

Values of the anyURI datatype defined by [XSD 1.0 Part 2: Datatypes] and [XSD 1.1 Part 2: Datatypes] carry the semantic information that they are intended to be IRIs, but the anyURI datatype does not provide firm assurance that they are in fact semantically or syntactically correct. In [XSD 1.0 Part 2: Datatypes], the type's lexical space is defined indirectly as the set of strings which, taken as input to an algorithm defined in [XML Linking Language], produce output strings which are "legal URIs" according to [RFC 2396]. Empirical studies show variation in the strictness with which XSD 1.0 processors enforce the syntactic constraints thus described, and in any case [RFC 2396] has since been made obsolete and been replaced by other specifications of URI syntax. In [XSD 1.0 Part 2: Datatypes], the anyURI datatype is loosely, not tightly, coupled to the defining documents for IRIs and URIs (which were [RFC 3987] and [RFC 3986] at the time this document was published). No syntactic checks on values of anyURI are prescribed, and the value space is described as the set of finite-length sequences of XML characters.

So while declaring an element or attribute as having type anyURI can provide a useful clue as to the meaning of the element or attribute, it does not provide any guarantees of semantic or syntactic correctness.

Checking that IRIs and IRI references are semantically correct is beyond the capacity of current automated systems. But in some contexts, it is likely to be helpful to check to see that anyURI values are in fact syntactically acceptable IRIs. There are limits to what is practical in this area: the syntactic rules for URIs (and thus for IRIs) depend on the URI scheme, and the set of recognized URI schemes is subject to change, so it is impractical to define a stable, unchanging type which checks candidate values against all the relevant rules. But values can be checked against the generic syntax for URIs and IRIs specified in [RFC 3986] and [RFC 3987]; such checks will not detect all errors in all ill-formed strings, but they will detect many. This document defines a number of IRI- and URI-related datatypes by systematically translating the augmented Backus-Naur Form (ABNF) grammar used in the RFCs into the regular-expression notation used in the XSD pattern facet.

Some applications and some XML vocabularies may impose further constraints on IRI usage: in some contexts (for example in setting a base IRI for resolution of relative references) it may be a requirement that the IRI provided be an absolute IRI, not a relative reference. This requires checking not against the generic syntax for IRI references (which is what is usually wanted for values intended to be IRIs) but against the more restrictive grammar of absolute-IRI.

This document defines several XSD datatypes corresponding to various subsets of IRIs. Most XML vocabularies, whether intended to encode information for consumption by humans or by machines, should use either anyURI or an appropriate IRI-based datatype. For completeness, however, and for use in the specialized situations where they are appropriate, analogous datatypes for URIs are also defined. The URI-based datatypes should be used only where there are compelling technical considerations that require the use of URIs and not IRIs.

The primary purpose of this document is to provide the formal definitions, in XSD notation, for the IRI- and URI-related datatypes mentioned above, in such a way as to enable interested readers to verify the equivalence between the regular expressions used to define them and the ABNF grammars used in [RFC 3986] and [RFC 3987]. This document does not attempt to describe the purpose and correct use of IRIs or URIs, or to address any of the issues relating to the internationalization of resource identifiers (or to internationalization in general). Readers seeking such guidance should consult other sources of information. The W3C Internationalization Activity has an extensive set of documents with information about internationalization.

2 Construction of patterns for syntax-checking

The datatypes defined here use the XSD pattern facet to constrain the lexical space to strings matching the appropriate construct in the ABNF grammars of [RFC 3987] and [RFC 3986]. (This is not possible for arbitrary ABNF grammars, because XSD patterns use regular expressions and thus define regular languages, while in the general case ABNF grammars define context-free languages. In the case of [RFC 3986] and [RFC 3987], the languages defined are regular, not context-free, and can be represented by XSD patterns without loss of any constraints.)

The translation of ABNF constructs (as defined in [RFC 2234] and [RFC 5234] and used in [RFC 3987] and [RFC 3986]) into XSD regular expressions is largely mechanical, but can be tedious and error-prone, and the resulting regular expressions are very long. To make it easier to verify the regular expressions against the ABNF grammar, this document builds up the regular expressions piece by piece, defining an XML entity for each non-terminal symbol in the ABNF grammar. The simple correspondence between entity declarations and ABNF productions makes it easier to check that the translation is correct. Both the ABNF productions and the entity declarations are presented in small blocks of code that can be compared individually. (For a brief description of the notation and display style used, see The literate-programming notation used here (§B).)

For example, [RFC 3987] and [RFC 3986] both use the non-terminals ALPHA, DIGIT, and HEXDIG, defined in [RFC 2234] thus:

        ALPHA          =  %x41-5A / %x61-7A   ; A-Z / a-z
        DIGIT          =  %x30-39
                               ; 0-9
        HEXDIG         =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

In [RFC 2234], each of these non-terminals denotes not a set of printable symbols but a set of integers. Section 2.3 of [RFC 2234] specifies: "Rules resolve into a string of terminal values, sometimes called characters. In ABNF a character is merely a non-negative integer."

Both [RFC 3987] and [RFC 3986] re-define the terminal symbols of ABNF as denoting characters, not integers (using the integer code points of the ISO 10646 / Unicode Universal Character Set to perform the integer → character mapping). So ALPHA, DIGIT, and HEXDIG can be translated into the regular expressions captured in the following entity declarations:

1 Definition of entities ALPHA, DIGIT, and HEXDIG

<!ENTITY ALPHA "([A-Za-z])">
<!ENTITY DIGIT "[0-9]">
<!ENTITY HEXDIG "[0-9A-Fa-f]">

These low-level constructs are used in defining higher-level constructs. In [RFC 3986] and [RFC 3987], the non-terminal pct-encoded is defined thus:

   pct-encoded    = "%" HEXDIG HEXDIG

This can be translated into an XSD regular expression using a reference to the HEXDIG entity defined elsewhere:

2 Definition of pct-encoded

<!ENTITY pct-encoded "(&#37;&HEXDIG;&HEXDIG;)">

The entity reference to &HEXDIG; here corresponds directly to the use of the non-terminal HEXDIG in the ABNF; the entity declaration is slightly easier to verify in this form than an equivalent declaration with the entity reference already expanded:

3 Alternate definition of pct-encoded

<!ENTITY pct-encoded "(&#37;[0-9A-Fa-f][0-9A-Fa-f])">

The greater ease of verification is particularly valuable for higher level constructs. The full regular expression pattern for the non-terminal IRI is over three thousand characters long, and would be very tedious to verify in that form.

Following the pattern of [RFC 3987] and [RFC 3986], this document will discuss the grammar in a generally top-down sequence. The schema document being defined follows a different order; it defines the entities bottom-up, to work around bugs in some widely used XML parsers.

Note that in the ABNF grammars of [RFC 3987] and [RFC 3986], some productions are ambiguous. The "first-match-wins" (or "greedy") matching algorithm applies. For details, see [RFC 3986]. The greedy-match rule does not affect the translation of the grammar into regular expressions for purposes of validating strings. If a string matches the ABNF grammar in more than one way, the greedy-match rule determines which internal structure to assign to the string, but it does not affect the membership of any string in the language defined by the grammar.

3 Datatypes for IRIs

3.1 Value Space

The value space of each of the types defined in this section is the set of strings recognized by the corresponding grammatical production in [RFC 3987]; the production used for each type is identified in the section on that type.

3.2 Lexical Mapping

The lexical mapping for these types, as for all datatypes derived from anyURI by restriction, is the identity mapping.

3.3 Facets

All of the IRI datatypes described here have the following constraining facet with a fixed value; this facet cannot be changed from the value shown:

whitespace = collapse (fixed)

Datatypes derived by restriction from any of these datatypes may specify values for the following constraining facets:

The fundamental facets of these datatypes have the following values, inherited from anyURI.

ordered = false
bounded = false
cardinality = countably infinite
false = true

3.4 The `IRI-reference-3987` datatype

The IRI-reference-3987 datatype includes all those strings which match the non-terminal IRI-reference in the ABNF grammar of [RFC 3987]; this includes both absolute and relative IRIs, with and without fragment identifiers. This is the datatype appropriate when it is desired to require that a string be a (potentially) legal resource identifier without further restrictions.

The ABNF grammar of IRI references in [RFC 3987] is:

IRI-reference  = IRI / irelative-ref

That is, an IRI reference is either an IRI or an internationalized relative reference. The grammar rule can be translated into a regular expression; the corresponding entity declaration is:

4 Definition of IRI-reference entity

<!ENTITY IRI-reference "(&IRI;|&irelative-ref;)">

The simple type definition for IRI-reference-3987, however, does not use the entity so defined; instead, it defines the datatype as the union of two separately defined types, IRI-3987 and relative-reference-3987. The lexical and value spaces so identified are the same, but defining the type as a union makes more explicit the relation between the class of IRI references and the two subclasses which make it up.

5 Simple type definition for IRI-reference-3987

  <xs:simpleType name="IRI-reference-3987">
    <xs:annotation>
      <xs:documentation xmlns="http://www.w3.org/1999/xhtml">
        <p>
          The <tt>IRI-reference-3987</tt> type checks the
          string against the regex grammar for IRI references
          in RFC 3987 Section 2.2.  This is the one most 
          users are likely to want when they say they want
          a generic URI or IRI type.
        </p>
        <p>The rule in the grammar is:</p>
        <pre>
          IRI-reference = IRI / irelative-ref
        </pre>
        <p>Rather than write this as a single pattern,
        however, we will just take a union of the two
        types already defined.</p>
      </xs:documentation>  
    </xs:annotation>
    <xs:union memberTypes="lib:IRI-3987 
                           lib:relative-reference-3987"/>
  </xs:simpleType>

3.5 The `IRI-3987` datatype

The IRI-3987 datatype includes all those strings which match the non-terminal IRI in the ABNF grammar of [RFC 3987]; this includes absolute IRIs with and without fragment identifiers. It excludes relative references and is thus appropriate only under special circumstances.

The ABNF grammar of IRIs in [RFC 3987] is:

IRI            = scheme ":" ihier-part [ "?" iquery ]
                      [ "#" ifragment ]

An IRI consists of a scheme, a colon, and an internationalized hierarchical part, optionally followed by a literal question mark and an internationalized query, and then (again optionally) by a literal hash mark and an internationalized fragment. The equivalent regular expression is used as the replacement text for the entity IRI:

6 Definition of IRI entity

<!ENTITY IRI 
  "(&scheme;:&ihier-part;((\?&iquery;))?((#&ifragment;))?)">

The simple type definition for the IRI-3987 datatype restricts the built-in anyURI type by requiring that values conform to the pattern defined by the regular expression in the replacement text of the entity IRI.

7 Simple type definition for IRI-3987

  <xs:simpleType name="IRI-3987">
    <xs:annotation>
        <xs:documentation xmlns="http://www.w3.org/1999/xhtml">
          <p>
            The IRI-3987 type checks the string against the
            regex grammar for IRI in RFC 3987 Section 2.2.
          </p>
          <p>
            Note that the grammar for IRI is essentially the
            same as that for absolute IRIs, with the
            addition of an optional hash mark (#) and
            fragment identifier:
          </p>
          <pre>
            IRI = scheme 
                  ":" ihier-part 
                  [ "?" iquery ] 
                  [ "#" ifragment ]
          </pre>
        </xs:documentation>
    </xs:annotation>
    <xs:restriction base="xs:anyURI">
      <xs:pattern value="&IRI;"/>
    </xs:restriction>
  </xs:simpleType>

The hierarchical part, query, and fragment can also occur in other top-level constructs; they are described in later sections (The hierarchical part (§3.8.2), The query (§3.8.6), and The fragment identifier (§3.8.7), respectively).

3.6 The `absolute-iri-3987` datatype

The datatype absolute-IRI-3987 includes all and only those strings which match the absolute-IRI grammar production of [RFC 3987].

The ABNF grammar of absolute IRIs in [RFC 3987] is:

absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]

This differs from the IRI construct only in omitting the optional hash mark and fragment identifier. The corresponding entity declaration is:

8 Definition of absolute-IRI

<!ENTITY absolute-IRI "(&scheme;:&ihier-part;((\?&iquery;))?)">

The simple type definition defines absolute-IRI-3987 as a restriction of anyURI to the strings matching the pattern.

9 Simple type definition for absolute-IRI-3987

  <xs:simpleType name="absolute-IRI-3987">
    <xs:annotation>
      <xs:documentation xmlns="http://www.w3.org/1999/xhtml">
        <p>
          The <tt>absolute-IRI-3987</tt> type checks
          the string against the regex grammar for
          absolute IRIs in RFC 3987 Section 2.2.
        </p>
        <p>The grammar is very like that for IRI, but it does
          not allow a fragment identifier.</p>
      </xs:documentation>
    </xs:annotation>
    <xs:restriction base="xs:anyURI">
      <xs:pattern value="&absolute-IRI;"/>         
    </xs:restriction>
  </xs:simpleType>

3.7 The `relative-reference-3987` datatype

The datatype relative-reference-3987 includes the set of internationalized relative references, which are all and only those strings which match the irelative-ref production of [RFC 3987].

The ABNF grammar of internationalized relative references in [RFC 3987] is:

irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]

The corresponding entity declaration is:

10 Definition of irelative-ref

<!ENTITY irelative-ref 
  "(&irelative-part;((\?&iquery;))?((#&ifragment;))?)">

The datatype relative-reference-3987 is unlikely to be of general utility, as it includes only IRI references relative to the base IRI of a given resource. The type is defined and given a name here primarily to simplify the definition of the IRI-reference datatype (defined above, The IRI-reference-3987 datatype (§3.4)). As with the other datatypes defined here, it restricts anyURI by restricting the lexical space to those strings matching the pattern.

11 Simple type definition for rel-ref-3987

  <xs:simpleType name="relative-reference-3987">
    <xs:annotation>
        <xs:documentation xmlns="http://www.w3.org/1999/xhtml">
          <p>
            The <tt>relative-reference-3987</tt> type
            checks the string against the regex grammar for
            relative references in RFC 3987 Section 2.2.
          </p>
          <p>The top-level rules in the grammar are:</p>
          <pre>
            irelative-ref  = irelative-part 
                             [ "?" iquery ] 
                             [ "#" ifragment ]
            
            irelative-part = "//" iauthority ipath-abempty
                           / ipath-absolute
                           / ipath-noscheme
                           / ipath-empty
          </pre>
        </xs:documentation>
    </xs:annotation>
    <xs:restriction base="xs:anyURI">
      <xs:pattern value="&irelative-ref;"/>
    </xs:restriction>
  </xs:simpleType>

3.8 Common constructs in the IRI grammars

        3.8.1 The IRI scheme
        3.8.2 The hierarchical part
        3.8.3 The relative part
        3.8.4 Authority information: user info, host, and port
        3.8.5 Paths and segments
        3.8.6 The query
        3.8.7 The fragment identifier
        3.8.8 Reserved, unreserved, and other character classes

This section outlines the ABNF rules and corresponding entity declarations for the constructs referred to by more than one of the constructs IRI, IRI-reference, irelative-reference, or absolute-IRI.

3.8.1 The IRI scheme

Both [RFC 3987] and [RFC 3986] define scheme the same way:

scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

A literal translation of this production would be

<!ENTITY scheme "(&ALPHA;((&ALPHA;|&DIGIT;|\+|-|\.))*)">

In the interests of more compact regular expressions, however, the entity scheme is defined in an equivalent but terser way:

12 Definition of scheme

<!--* The URI (or IRI) scheme *-->
<!--* Same rules in RFC 3986 and RFC 3987 *-->

<!ENTITY scheme "(&ALPHA;[A-Za-z0-9+\-\.]*)">

3.8.2 The hierarchical part

The non-terminal symbol ihier-part describes the hierarchical part of an IRI. Its ABNF definition is:

ihier-part     = "//" iauthority ipath-abempty
               / ipath-absolute
               / ipath-rootless
               / ipath-empty

For legibility (in particular, to keep the line length from growing too large), the entity declaration for ihier-part breaks this declaration up into four parts, one for each line of the ABNF. A straightforward translation would be as follows.

<!ENTITY ihp-1 "(//&iauthority;&ipath-abempty;)">
<!ENTITY ihp-2 "&ipath-absolute;">
<!ENTITY ihp-3 "&ipath-rootless;">
<!ENTITY ihp-4 "&ipath-empty;">

<!ENTITY ihier-part "(&ihp-1;|&ihp-2;|&ihp-3;|&ihp-4;)">

However, the last disjunct in the production for irelative-part expands to the empty string.

ipath-empty    = 0<ipchar>

This can be rendered as the following

<!ENTITY ipath-empty "">
<!--* ... *-->
<!ENTITY ihp-4 "&ipath-empty;">

Because ipath-empty expands to the empty string, however (as does, in consequence, also ihp-4), this is effectively the same as the following construct:

<!ENTITY ihier-part
  "(&ihp-1;|&ihp-2;|&ihp-3;|)">

The empty branch is legal in XSD regular expressions, but at least one widely used XSD validator has, in some versions, an error which causes it not to interpret the trailing empty branch correctly. The definition of ihier-part works around this problem by using an alternative formulation which omits the empty branch and makes the entire construct optional.

13 Definition of ihier-part

<!ENTITY ihp-1 "(//&iauthority;&ipath-abempty;)">
<!ENTITY ihp-2 "&ipath-absolute;">
<!ENTITY ihp-3 "&ipath-rootless;">
<!ENTITY ihp-4 "&ipath-empty;">

<!ENTITY ihier-part "(&ihp-1;|&ihp-2;|&ihp-3;)?">

The various declarations relating to the hierarchical part are gathered together in the following fragment:

14 Definition of hierarchical part of IRI

<!--* The hierarchical part of the IRI:  authority and path *-->
<!--* Authority:  user info, host, port number *-->

《 17 Definition of ihost, etc. 》

《 16 Definition of iauthority and port 》

《 23 Definition of internationalized paths 》

《 13 Definition of ihier-part 》

<!--* end of hier-part *-->

3.8.3 The relative part

The non-terminal irelative-part is almost identical to ihier-part, but it excludes the non-terminal ipath-rootless and adds ipath-noscheme.

irelative-part = "//" iauthority ipath-abempty
               / ipath-absolute
               / ipath-noscheme
               / ipath-empty

Like the translation of ihier-part, the rendering of this rule breaks up the right-hand side into parts, to keep the line-length manageable. Again, the empty branch is represented by an optionality marker on the expression as a whole, rather than as a separate branch.

15 Definition of irelative-part entity

<!ENTITY irp-1 "(//&iauthority;&ipath-abempty;)">
<!ENTITY irp-2 "&ipath-absolute;">
<!ENTITY irp-3 "&ipath-noscheme;">
<!ENTITY irp-4 "&ipath-empty;">
<!ENTITY irelative-part "(&irp-1;|&irp-2;|&irp-3;)?">


<!--* Some regexp handlers turn out to have 
* problems with the trailing empty branch, 
* so we delete it and make the entire 
* expression optional instead. The bug has been
* reported, but in the meantime let's work around it.  
*-->

3.8.4 Authority information: user info, host, and port

The authority portion of an IRI identifies the authoritative host for a given resource, along with optional user and port information. The top-level construct, along with user and port information, is defined as follows in ABNF:

iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
iuserinfo      = *( iunreserved / pct-encoded / sub-delims / ":" )
port           = *DIGIT

The equivalent regular expressions and entities are these.

16 Definition of iauthority and port

<!ENTITY port "&DIGIT;*">
<!ENTITY port "(&DIGIT;)*">

<!ENTITY iuserinfo 
  "([&pcg-iunreserved;&pcg-sub-delims;:]|&pct-encoded;)*">

<!ENTITY iauthority "(((&iuserinfo;@))?&ihost;((:&port;))?)">

A more mechanical translation would render iuserinfo this way:

<!ENTITY iuserinfo 
  "((&iunreserved;|&pct-encoded;|&sub-delims;|:))*">

Here as in some other places the regular expressions merge a disjunction of character classes into a single character class. So instead of separate references to iunreserved and sub-delims, the definition of iuserinfo makes a single character class, with references to the positive character groups for those non-terminals. (For any non-terminal N which is logically a character class, an entity named pcg-N denotes the positive character group used to define N (in these cases the positive character group is, informally, the character class without the enclosing square brackets).

The identification of the host involves a more elaborate set of grammatical rules than any other part of the grammar, primarily to account for syntactic variations introduced by IPv6. The relevant ABNF productions, and the corresponding entity declarations, are these.

ihost          = IP-literal / IPv4address / ireg-name

ireg-name      = *( iunreserved / pct-encoded / sub-delims )


IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

An internationalized host name is an IP literal, an IPv4 address, or an ireg-name (internationalized registered name). An internationalized registered name is a sequence of zero or more unreserved characters, sub-delimiters, or percent-encoded characters. An IP literal is an IPv6 or an IPvFuture address enclosed in square brackets. An IPvFuture address is a sequence of one or more unreserved or sub-delimiter characters, preceded by "c", one or more hex digits, and a full stop.

The corresponding entity declarations are these.

17 Definition of ihost, etc.

<!--* Host:  the most elaborate part of the grammar.
    * reg-name, IPv4, IPv6, and IPvFuture.
    *-->

《 19 Definition of dec-octet 》

《 18 Definition of IPv4 and IPv6 》

<!ENTITY ireg-name 
  "((&iunreserved;|&pct-encoded;|&sub-delims;))*">

<!ENTITY ihost 
  "(&IP-literal;|&IPv4address;|&ireg-name;)">

IPv4 and IPv6 addressed are defined this way in the ABNF:

IPv6address    =                            6( h16 ":" ) ls32
               /                       "::" 5( h16 ":" ) ls32
               / [               h16 ] "::" 4( h16 ":" ) ls32
               / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
               / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
               / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
               / [ *4( h16 ":" ) h16 ] "::"              ls32
               / [ *5( h16 ":" ) h16 ] "::"              h16
               / [ *6( h16 ":" ) h16 ] "::"

h16            = 1*4HEXDIG
ls32           = ( h16 ":" h16 ) / IPv4address

IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

The corresponding entities are these. For legibility (shorter line length), each line of the rule for IPv6 is translated into a separate entity, and these entities are then aggregated. (For the same reason, the entity octet is introduced to give a shorter name for the content of dec-octet.)

18 Definition of IPv4 and IPv6

<!ENTITY octet "&dec-octet;">
<!ENTITY IPv4address "(&octet;\.&octet;\.&octet;\.&octet;)">

<!ENTITY h16 "&HEXDIG;{0,4}">
<!ENTITY h16 "(&HEXDIG;){0,4}">

<!ENTITY ls32 "((&h16;:&h16;)|&IPv4address;)">

<!ENTITY IPv6-1 "((((&h16;:)){6}&ls32;)">
<!ENTITY IPv6-2 "(::((&h16;:)){5}&ls32;)">
<!ENTITY IPv6-3 "((&h16;)?::((&h16;:)){4}&ls32;)">
<!ENTITY IPv6-4 "(((((&h16;:))?&h16;))?::((&h16;:)){3}&ls32;)">
<!ENTITY IPv6-5 "(((((&h16;:)){0,2}&h16;))?::((&h16;:)){2}&ls32;)">
<!ENTITY IPv6-6 "(((((&h16;:)){0,3}&h16;))?::&h16;:&ls32;)">
<!ENTITY IPv6-7 "(((((&h16;:)){0,4}&h16;))?::&ls32;)">
<!ENTITY IPv6-8 "(((((&h16;:)){0,5}&h16;))?::&h16;)">
<!ENTITY IPv6-9 "(((((&h16;:)){0,6}&h16;))?::))">

<!ENTITY IPv6-1-3 "&IPv6-1;|&IPv6-2;|&IPv6-3;">
<!ENTITY IPv6-4-6 "&IPv6-4;|&IPv6-5;|&IPv6-6;">
<!ENTITY IPv6-6-9 "&IPv6-7;|&IPv6-8;|&IPv6-9;">
<!ENTITY IPv6address "&IPv6-1-3;|&IPv6-4-6;|&IPv6-6-9;">


<!ENTITY IPvFuture 
  "(v&HEXDIG;+\.[&pcg-unreserved;&pcg-sub-delims;:]+)">

<!ENTITY IP-literal "(\[(&IPv6address;|&IPvFuture;)\])">

The declaration for IPvFuture combines multiple non-terminals into a single character class in the fashion described above.

The final business remaining in the definition of authority information is to define dec-octet formally. The ABNF allows numerals for any integer between 0 and 255, inclusive, and forbids unnecessary leading zeros.

dec-octet      = DIGIT                 ; 0-9
               / %x31-39 DIGIT         ; 10-99
               / "1" 2DIGIT            ; 100-199
               / "2" %x30-34 DIGIT     ; 200-249
               / "25" %x30-35          ; 250-255

The equivalent regular expressions are these.

19 Definition of dec-octet

<!ENTITY dec-0xx "&DIGIT;|([1-9]&DIGIT;)">
<!ENTITY dec-1xx "(1(&DIGIT;){2})">
<!ENTITY dec-2xx "(2[0-4]&DIGIT;)|(25[0-5]))">

<!ENTITY dec-octet "(&dec-0xx;|&dec-1xx;|&dec-2xx;)">

3.8.5 Paths and segments

There are several varieties of internationalized path, in a hierarchical or relative part of an IRI or relative reference. In ABNF:

ipath          = ipath-abempty   ; begins with "/" or is empty
               / ipath-absolute  ; begins with "/" but not "//"
               / ipath-noscheme  ; begins with a non-colon segment
               / ipath-rootless  ; begins with a segment
               / ipath-empty     ; zero characters

The translation into entity notation makes separate entities for each line of the ABNF rule, solely for legibility reasons.

20 Definition of ipath entity

<!ENTITY ip-1 "&ipath-abempty;">
<!ENTITY ip-2 "&ipath-absolute;">
<!ENTITY ip-3 "&ipath-noscheme;">
<!ENTITY ip-4 "&ipath-rootless;">
<!ENTITY ip-4 "&ipath-empty;">
<!ENTITY ipath "(&ip-1;|&ip-2;|&ip-3;|&ip-4;|&ip-5;)">

The individual forms of path are defined thus:

 
ipath-abempty  = *( "/" isegment )
ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
ipath-noscheme = isegment-nz-nc *( "/" isegment )
ipath-rootless = isegment-nz *( "/" isegment )
ipath-empty    = 0<ipchar>

The translation into entity notation is straightforward.

21 Kinds of ipath

<!ENTITY ipath-abempty "((/&isegment;))*">

<!ENTITY ipath-absolute "(/((&isegment-nz;((/&isegment;))*))?)">

<!ENTITY ipath-noscheme "(&isegment-nz-nc;((/&isegment;))*)">

<!ENTITY ipath-rootless "(&isegment-nz;((/&isegment;))*)">

<!ENTITY ipath-empty "">

Individual segments of a path are made up of (internationalized) path characters:

isegment       = *ipchar
isegment-nz    = 1*ipchar
isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
                     / "@" )
               ; non-zero-length segment without any colon ":"

The translation into entity notation is again straightforward.

22 Definition of isegment entity, etc.

<!ENTITY isegment "(&ipchar;)*">

<!ENTITY isegment-nz "(&ipchar;)+">

<!ENTITY isegment-nz-nc 
  "([&pcg-iunreserved;&pcg-sub-delims;@]|&pct-encoded;)+">
<--* literal rendering:
<!ENTITY isegment-nz-nc 
  "((&iunreserved;|&pct-encoded;|&sub-delims;|@))+">
*-->

The declarations relating to paths are pulled together in the following fragment:

23 Definition of internationalized paths

<!--* Path (second major part of hier-part):  
       * first segments, then various kinds of path *-->

《 22 Definition of isegment entity, etc. 》

《 21 Kinds of ipath 》

《 20 Definition of ipath entity 》

3.8.6 The query

iquery         = *( ipchar / iprivate / "/" / "?" )

24 Definition of iquery

<!--* Query part *-->

《 29 Definition of iprivate entity 》

<!ENTITY iquery "(&ipchar;|[&pcg-iprivate;/?])*">
<!ENTITY iquery "((&ipchar;|&iprivate;|/|\?))*">

3.8.7 The fragment identifier

ifragment      = *( ipchar / "/" / "?" )

25 Definition of ifragment

<!--* Fragment part *-->

<!ENTITY ifragment "((&ipchar;|/|\?))*">

3.8.8 Reserved, unreserved, and other character classes

The lowest-level constructs in the grammar are the definitions of reserved character, unreserved character, and other character classes. This section presents the ABNF definitions of the classes and their regular-expression equivalents.

The non-terminal ipchar describes the characters usable in internationalized path expressions.

ipchar         = iunreserved / pct-encoded / sub-delims / ":"
               / "@"

The definition pulls the literals ":" and "@" into the same character class expression as the sub-delimiters; otherwise it's a literal translation of the ABNF.

26 Definition of ipchar

<!ENTITY ipchar 
  "(&iunreserved;|&pct-encoded;|[&pcg-sub-delims;:@])">
<!--* Literal translation of ABNF:
<!ENTITY ipchar 
  "(&iunreserved;|&pct-encoded;|&sub-delims;|:|@)">
*-->

The iunreserved class of characters extends the unreserved class of [RFC 3986] by adding the set of legal UCS characters.

iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

The translation groups all the characters in the class into a single character group, instead of translating the ABNF disjunction into a disjunction.

27 Definition of iunreserved character class

<!ENTITY pcg-iunreserved 
  "&pcg-unreserved;&UCS_0;&UCS_4;&UCS_8;&UCS_C;">
<!ENTITY iunreserved "[&pcg-iunreserved;]">
<!--* literal translation of ABNF 
<!ENTITY iunreserved
  "(&ALPHA;|&DIGIT;|-|\.|_|~|&ucschar;)">
*-->

The character class ucschar contains all the legal code points of UCS-2 except those in the 7-bit ASCII / ISO 646 range, which are not all allowed and which have in any case already been dealt with.

ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
               / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
               / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
               / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
               / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
               / %xD0000-DFFFD / %xE1000-EFFFD

Note that this definition treats excludes both the private use areas and the surrogate code points in the range %xD800-DFFF; it includes the characters %x10000-EFFFD which lie outside the basic multilingual plane.

The translation uses several levels of entity redirection in an effort to keep the DTD more legible.

28 Definition of UCS character class

<!ENTITY ucs_01 "&#xA0;-&#xD7FF;" > 
<!ENTITY ucs_02 "&#xF900;-&#xFDCF;" >
<!ENTITY ucs_03 "&#xFDF0;-&#xFFEF;" >
<!ENTITY ucs_10 "&#x10000;-&#x1FFFD;" >
<!ENTITY ucs_20 "&#x20000;-&#x2FFFD;" >  
<!ENTITY ucs_30 "&#x30000;-&#x3FFFD;" >
<!ENTITY ucs_40 "&#x40000;-&#x4FFFD;" >  
<!ENTITY ucs_50 "&#x50000;-&#x5FFFD;" >  
<!ENTITY ucs_60 "&#x60000;-&#x6FFFD;" >
<!ENTITY ucs_70 "&#x70000;-&#x7FFFD;" >  
<!ENTITY ucs_80 "&#x80000;-&#x8FFFD;" >  
<!ENTITY ucs_90 "&#x90000;-&#x9FFFD;" >
<!ENTITY ucs_A0 "&#xA0000;-&#xAFFFD;" >  
<!ENTITY ucs_B0 "&#xB0000;-&#xBFFFD;" >  
<!ENTITY ucs_C0 "&#xC0000;-&#xCFFFD;" >
<!ENTITY ucs_D0 "&#xD0000;-&#xDFFFD;" >  
<!ENTITY ucs_E0 "&#xE1000;-&#xEFFFD;" >
<!ENTITY UCS_0 "&ucs_01;&ucs_02;&ucs_03;&ucs_10;&ucs_20;&ucs_30;">
<!ENTITY UCS_4 "&ucs_40;&ucs_50;&ucs_60;&ucs_70;">
<!ENTITY UCS_8 "&ucs_80;&ucs_90;&ucs_A0;&ucs_B0;">
<!ENTITY UCS_C "&ucs_C0;&ucs_D0;&ucs_E0;">
<!ENTITY ucschar "[&UCS_0;&UCS_4;&UCS_8;&UCS_C;]">

The non-terminal iprivate recognizes the characters in the private use areas of UCS. It is used only by iquery, but conceptually it seems better to deal with it here together with the other UCS-based classes.

iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

The translation is straightforward, though it uses one level of indirection through a pcg- entity, and another level of indirection for the individual ranges.

29 Definition of iprivate entity

<!ENTITY pcg-range1 "&#xE000;-&#xF8FF;" >
<!ENTITY pcg-range2 "&#xF0000;-&#xFFFFD;" >
<!ENTITY pcg-range3 "&#x100000;-&#x10FFFD;" >
<!ENTITY pcg-iprivate "&pcg-range1;&pcg-range2;&pcg-range3;" >
<!--* literal translation: 
<!ENTITY pcg-iprivate  
"&#xE000;-&#xF8FF;&#xF0000;-&#xFFFFD;&#x100000;-&#x10FFFD;"> 
*-->
<!ENTITY iprivate "[&pcg-iprivate;]" >
<!--* literal translation: 
<!ENTITY iprivate "([-]|[󰀀-󿿽]|[􀀀-􏿽])">
*-->

The unreserved character class in [RFC 3987] is taken over without change from [RFC 3986]:

unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"

For unreserved, a separate entity is defined for the positive character group, to allow it to be combined with other positive character groups in other entity declarations.

30 Definition of unreserved

<!ENTITY pcg-unreserved "A-Za-z0-9\-\._~">
<!ENTITY unreserved "[&pcg-unreserved;]">
<!--* literal translation of the ABNF:
<!ENTITY unreserved "(&ALPHA;|&DIGIT;|-|\.|_|~)">
*-->

The reserved characters are just the general delimiters and the sub-delimiters. [RFC 3986] notes that these "may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm."

reserved       = gen-delims / sub-delims

The definition of reserved can use the pcg- entities defined below for the two delimiter classes.

31 Definition of reserved

<!ENTITY reserved "[&pcg-gen-delims;&pcg-sub-delims;]">
<!--* literal translation of the ABNF:
<!ENTITY reserved "(&gen-delims;|&sub-delims;)">
*-->

The general delimiters are those which (in the words of [RFC 3986]) are "used as delimiters of the generic URI components" defined by that specification.

gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"

We define gen-delims indirectly, via pcg-gen-delims; this allows the class to be combined with other classes in simpler regular expressions.

32 Definition of gen-delim

<!ENTITY pcg-gen-delims ":/?#\[\]@">
<!ENTITY gen-delims "[&pcg-gen-delims;]">
<!--* literal translation of the ABNF:
<!ENTITY gen-delims "(:|/|\?|#|\[|\]|@)">
*-->

The sub-delimiters are reserved for use to delimit subcomponents within the larger-level generic components of the URI or IRI.

sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
               / "*" / "+" / "," / ";" / "="

Since sub-delims is often combined with other classes of characters in disjunctions, it's helpful to define an entity for the positive character group used in its character-class expression. So we define pcg-sub-delims for that positive character group, and define sub-delims as "[&pcg-sub-delims;]".

33 Definition of sub-delims

<!--* pcg-sub-delims: the 'positive character group' in
    * sub-delims
    * (We give it a name to make it more easily reusable.)
    *-->
<!ENTITY pcg-sub-delims "!$&amp;'()*+,;=">
<!ENTITY sub-delims "[&pcg-sub-delims;]">
<!--* literal translation:
<!ENTITY sub-delims "(!|$|&amp;|'|\(|\)|\*|\+|,|;|=)">
*-->

All these are pulled together by the following DTD fragment.

34 Definitions of character classes

<!--* Character classes, groups, what have you *-->
<!--* These are all unchanged from RFC 3986,
    * except for ipchar and iunreserved, which are 
    * internationalized versions of pchar and unreserved.
    *-->

《 1 Definition of entities ALPHA, DIGIT, and HEXDIG 》
《 33 Definition of sub-delims 》
《 32 Definition of gen-delim 》
《 31 Definition of reserved 》
《 30 Definition of unreserved 》

《 28 Definition of UCS character class 》
《 27 Definition of iunreserved character class 》

<!--* pct-encoded isn't really a character class, but
    * it needs to fit in here before ipchar
    *-->
《 2 Definition of pct-encoded 》
《 26 Definition of ipchar 》

3.9 The `IRI-RFC3987` schema document

        3.9.1 Overall structure
        3.9.2 The initial annotation
        3.9.3 The DTD internal subset
        3.9.4 Versioning policy

The IRI-related types defined in this document are all formally defined by the schema document at http://www.w3.org/2001/03/XMLSchema/TypeLibrary-IRI-3987.xsd, which gathers together the code fragments given above in a suitable order.

3.9.1 Overall structure

The overall structure of the schema document is as follows:

35 The IRI-RFC3987 schema document

<?xml version="1.0"?>
《 36 XML stylesheet instruction 》
《 37 Document type declaration 》
<xs:schema 
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
  xmlns:lib = 
    "http://www.w3.org/2001/03/XMLSchema/TypeLibrary" 
  version="1.0" 
  elementFormDefault="qualified" 
  xml:lang="en" 
  targetNamespace =
    "http://www.w3.org/2001/03/XMLSchema/TypeLibrary">

  《 38 Description of the schema document 》 
  《 7 Simple type definition for IRI-3987 》
  《 9 Simple type definition for absolute-IRI-3987 》
  《 11 Simple type definition for rel-ref-3987 》
  《 5 Simple type definition for IRI-reference-3987 》
  《 42 Versioning policy for IRI-related types 》

</xs:schema>

To make the XSD schema document display more legibly in Web browsers, we specify an XML stylesheet instruction pointing to an XSLT stylesheet for XSD schema documents.

36 XML stylesheet instruction

<?xml-stylesheet href="http://www.w3.org/2008/09/xsd.xsl" 
    type="text/xsl"?>

The document-type declaration refers to the normative DTD for XSD schema documents, and includes a fairly extensive internal DTD subset (described more fully below, The DTD internal subset (§3.9.3)).

37 Document type declaration

<!DOCTYPE xs:schema 
          PUBLIC "-//W3C//DTD XMLSchema 200102//EN" 
                 "http://www.w3.org/2001/XMLSchema.dtd" [

《 39 Internal DTD subset 》

]>

3.9.2 The initial annotation

The first xs:annotation element in the schema document provides a general description of the contents and origin of the document.

38 Description of the schema document

  <xs:annotation>
   <xs:documentation xmlns="http://www.w3.org/1999/xhtml">

     <h3>Introduction</h3>

     <p>This schema document describes a [draft]
     component of the XML Schema type library: datatypes for
     IRIs as defined by RFC 3987.</p>
    
     <p>The types defined here check the conformance of
     literal strings against the grammar given in section
     2.2 of <a href =
     "http://www.ietf.org/rfc/rfc3987.txt">RFC 3987</a>,
     translated into XSD notation.  See also the <a
     href="TypeLibrary-URI-RFC3986.xsd">schema document for
     URIs</a> located in the same directory as this
     document.
     </p>

     <p>Please send suggestions for improvements to
     www-xml-schema-comments@w3.org.  Mention the URI of
     this document: <code><a href=
     "http://www.w3.org/2012/01/XMLSchema/TypeLibrary-IRI-3987.xsd">
     http://www.w3.org/2012/01/XMLSchema/TypeLibrary-IRI-3987.xsd
     </a></code></p>

     <p>See below (at the bottom of this document) for
     information about the revision and namespace-versioning
     policy governing this schema document.</p>
     
   </xs:documentation>
  </xs:annotation>

3.9.3 The DTD internal subset

The internal subset of the DTD includes the entity declarations shown elsewhere in this document, in a suitable sequence.

39 Internal DTD subset

《 40 Miscellaneous element and attribute declarations 》

《 41 Initial explanatory comment 》

《 34 Definitions of character classes 》

《 12 Definition of scheme 》

《 52 Definition of hierarchical part of URI 》

《 24 Definition of iquery 》

《 25 Definition of ifragment 》


<!--* Relative references *-->

《 15 Definition of irelative-part entity 》

《 10 Definition of irelative-ref 》

<!--* IRIs, relative references, IRI references *-->

《 6 Definition of IRI entity 》

《 4 Definition of IRI-reference entity 》

《 8 Definition of absolute-IRI 》

Because the DTD for schema documents does not include elements suitable for use within the xs:documentation element, we define p here. We also declare the namespace attribute xmlns for the xs:documentation element.

40 Miscellaneous element and attribute declarations

<!ATTLIST xs:documentation xmlns CDATA #IMPLIED>
<!ELEMENT p (#PCDATA)>

Editorial Note: Are these declarations necessary? useful? They look a bit like an early effort to make the document suitable for editing in a DTD-driven editor, which may have been abandoned before completion. We should either make them complete (which means bringing in suitable XHTML modules) or suppress them.

The long internal DTD subset is likely to confuse some readers unless we explain what we are doing and why there are so many entity declarations. (From the XML specification's point of view, we could isolate the complex sequence of entity declarations in a separate DTD file, but in that case many Web browsers would fail to display the document usefully.)

41 Initial explanatory comment

<!--* The regex patterns will be rather complicated, and
    * will be hard to verify and debug if we're not careful.
    * So we build the regexes systematically by transforming 
    * the ABNF grammar of the RFC into entity declarations:
    * references to literals turn into literals, and
    * references to non-terminals turn into entity
    * references.  (Don't try this with a context-free
    * grammar; you'll get circular entity references.)
    *-->

<!--* We give the entity declarations in a bottom-up
    * order, because some XML parsers make the mistake of
    * trying to expand the entities when reading the entity
    * declaration, and want declaration before use.  (DV,
    * listen to me when I am talking to you.)
    *
    * When multiple entity declarations are given, the last
    * one shown is the one created by the mechanical
    * translation.  The earlier ones are manual
    * reformulations of the expression mostly for
    * compactness and clarity, and occasionally to fix
    * problems with character escaping.
    *-->

3.9.4 Versioning policy

The versioning policy for this schema document is the same as for documents in the W3C Technical Reports area: there is a single standard location for the schema document, which will also contain the most recent version of the document approved by the Working Group, and for each revision of the document there is a dated version, which will not change in any substantive way.

42 Versioning policy for IRI-related types

  <xs:annotation>
  <xs:documentation>

    <h3>Versioning policy for this document</h3>

    
    <p>
      In keeping with the XML Schema WG's standard
      versioning policy, this schema document will 
      persist at the URI
      < http://www.w3.org/2012/01/XMLSchema/TypeLibrary-IRI-3987.xsd >.
    </p>
    <p>
      At the date of issue it can also be found at
      http://www.w3.org/2001/03/XMLSchema/URI-3987.xsd.
      The schema document at that URI may however change in
      the future, in order to remain compatible with the
      latest version of XML Schema itself.  In other words,
      if the XML Schema namespace changes, the version of
      this document at &lt; 
      http://www.w3.org/2001/03/XMLSchema/TypeLibrary-IRI-3987.xsd 
      &gt; will change accordingly; the version at &lt; 
      http://www.w3.org/2012/01/XMLSchema/TypeLibrary-IRI-3987.xsd 
      &gt; will not change.
    </p>
    <p>
      Previous dated (and unchanging) versions of this
      schema document include:
     </p>
     <ul>
       <li>
       http://www.w3.org/2011/04/XMLSchema/TypeLibrary-IRI-3987.xsd 
       </li>
     </ul>
    
  </xs:documentation>
  </xs:annotation>

4 Datatypes for URIs

This section of this document defines a set of datatypes based on [RFC 3986], which accept only values which match the syntactic definition of URIs in that document. Because these datatypes do not accept characters outside the 7-bit range of ASCII and ISO 646, they are not suitable for general-purpose use in human-readable documents. They should be used only in contexts where it is necessary to require that the mechanical transformation from IRI to URI has already been performed.

4.1 Value Space

The value space of each of the types defined in this section is the set of strings recognized by the corresponding grammatical production in [RFC 3986]; the production used for each type is identified in the section on that type.

4.2 Lexical Mapping and Facets

The lexical mapping and facet information for these types is the same as described above for the IRI types in Lexical Mapping (§3.2) and Facets (§3.3).

4.3 The `URI-reference-3986` datatype

The URI-reference-3986 datatype includes all those strings which match the non-terminal URI-reference in the ABNF grammar of [RFC 3986]; this includes both absolute and relative URIs, with and without fragment identifiers.

The ABNF grammar of URI references in [RFC 3986] is:

URI-reference = URI / relative-ref

That is, a URI reference is either a URI or a relative reference. The grammar rule can be translated into a regular expression; the corresponding entity declaration is:

43 Definition of URI-reference entity

<!ENTITY URI-reference "(&URI;|&relative-ref;)">

Like the analogous IRI-based type, however, the simple type definition for URI-reference-3986 does not use this entity so defined; it defines the datatype as the union of the separately defined types URI-3986 and relative-reference-3986.

44 Simple type definition for URI-reference-3986

  <xs:simpleType name="URI-reference-3986">
    <xs:annotation>
      <xs:documentation xmlns="http://www.w3.org/1999/xhtml">
        <p>
          The <tt>URI-reference-3986</tt> type checks
          the string against the regex grammar for URI
          references in RFC 3986 Appendix A.  This is the
          one most users are likely to say they want when
          they are looking for a generic URI type and have
          thought about it for a bit.  But it's not: what
          they are most likely to want in reality is the
          IRI-reference type defined in the schema document
          for IRIs, in this directory.
        </p>
        <p>The rule in the grammar is:</p>
        <pre>
          URI-reference = URI / relative-ref
        </pre>
        <p>Rather than write this as a single pattern,
        however, we will just take a union of the two
        types already defined.</p>
      </xs:documentation>  
    </xs:annotation>
    <xs:union memberTypes="lib:URI-3986 
                           lib:relative-reference-3986"/>
  </xs:simpleType>

4.4 The `URI-3986` datatype

The URI-3986 datatype includes all those strings which match the non-terminal URI in the ABNF grammar of [RFC 3986]; this includes absolute URIs with and without fragment identifiers. It excludes relative references and is thus appropriate only under special circumstances.

The ABNF grammar of URIs in [RFC 3986] is:

URI           = scheme ":" hier-part 
                 [ "?" query ] 
                 [ "#" fragment ]

An URI consists of a scheme, a colon, and a hierarchical part, optionally followed by a literal question mark and a query, and then (again optionally) by a literal hash mark and a fragment identifier. The equivalent regular expression is used as the replacement text for the entity URI:

45 Definition of URI entity

<!ENTITY URI 
"(&scheme;:&hier-part;((\?&query;))?((#&fragment;))?)">

The simple type definition for the URI-3986 datatype restricts the built-in anyURI type by requiring that values conform to the pattern defined by the regular expression in the replacement text of the entity URI.

46 Simple type definition for URI-3986

  <xs:simpleType name="URI-3986">
    <xs:annotation>
        <xs:documentation xmlns="http://www.w3.org/1999/xhtml">
	  <p>RFC 3986 says:</p>
	  <blockquote>
	    <p>
	      A URI is an identifier consisting of a
	      sequence of characters matching the syntax
	      rule named &lt;URI&gt; in Section 3.
	      It enables uniform identification of resources
	      via a separately defined extensible set of
	      naming schemes (Section 3.1).  How that
	      identification is accomplished, assigned, or
	      enabled is delegated to each scheme
	      specification.
	    </p>
	  </blockquote>

	  <p>
            The URI-3986 type checks the string against the
            regex grammar for URI in RFC 3986 Appendix A.
            (The regex in Appendix B would be simpler, but
            it accepts any string of Basic Latin characters,
            whether they satisfy the grammar for URIs or
            not.  So for validation, it's useless.)
	  </p>
	  <p>
	    Note that the grammar for URI is essentially the
	    same as that for absolute URIs, with the
	    addition of an optional hash mark (#) and
	    fragment identifier:
	  </p>
	  <pre>
	    URI = scheme ":" hier-part 
                  [ "?" query ] 
                  [ "#" fragment ]
	  </pre>
        </xs:documentation>
    </xs:annotation>
    <xs:restriction base="xs:anyURI">
      <xs:pattern value="&URI;"/>

      
    </xs:restriction>
  </xs:simpleType>

The hierarchical part, query, and fragment can also occur in other top-level constructs; they are described in later sections (The hierarchical part (§4.7.1), The query (§4.7.5), and The fragment identifier (§4.7.6), respectively).

4.5 The `absolute-uri-3986` datatype

The datatype absolute-URI-3986 includes all and only those strings which match the absolute-URI grammar production of [RFC 3986].

The ABNF grammar of absolute URIs in [RFC 3986] is:

absolute-URI  = scheme ":" hier-part [ "?" query ]

This differs from the URI construct only in omitting the optional hash mark and fragment identifier. The corresponding entity declaration is:

47 Definition of absolute-URI

<!ENTITY absolute-URI 
  "(&scheme;:&hier-part;((\?&query;))?)">

The simple type definition defines absolute-URI-3986 as a restriction of anyURI to the strings matching the pattern.

48 Simple type definition for absolute-URI-3986

  <xs:simpleType name="absolute-URI-3986">
    <xs:annotation>
        <xs:documentation xmlns="http://www.w3.org/1999/xhtml">
	  <p>
            The <tt>absolute-URI-3986</tt> type checks the
            string against the regex grammar for absolute URIs
            in RFC 3986 Appendix A.
	  </p>
	  <p>The grammar is very like that for URI, but it does
	  not allow a fragment identifier.</p>
        </xs:documentation>
    </xs:annotation>
    <xs:restriction base="xs:anyURI">
      <xs:pattern value="&absolute-URI;"/>	  
    </xs:restriction>
  </xs:simpleType>

4.6 The `relative-reference-3986` datatype

The datatype relative-reference-3986 includes the set of (uninternationalized) relative references, which are all and only those strings which match the relative-ref production of [RFC 3986].

The ABNF grammar of relative references in [RFC 3986] is:

relative-ref  = relative-part 
                [ "?" query ] 
                [ "#" fragment ]

The corresponding entity declaration is:

49 Definition of relative-ref

<!ENTITY rref-1 "&relative-part;">
<!ENTITY rref-2 "(\?&query;)">
<!ENTITY rref-3 "(#&fragment;)">
<!ENTITY relative-ref "(&rref-1;(&rref-2;)?(#&rref-3;)?)">

The datatype relative-reference-3986 is unlikely to be of general utility, as it includes only URI references relative to the base URI of a given resource. The type is defined and given a name here primarily to simplify the definition of the URI-reference datatype (defined above, The URI-reference-3986 datatype (§4.3)). As with the other datatypes defined here, it restricts anyURI by restricting the lexical space to those strings matching the pattern.

50 Simple type definition for relative-reference-3986

 <xs:simpleType name="relative-reference-3986">
    <xs:annotation>
        <xs:documentation xmlns="http://www.w3.org/1999/xhtml">
	  <p>
            The <tt>relative-reference-3986</tt> type
            checks the string against the regex grammar for
            relative references in RFC 3986 Appendix A.
	  </p>
	  <p>The top-level rules in the grammar are:</p>
	  <pre>
	    relative-ref  = relative-part 
                            [ "?" query ] 
                            [ "#" fragment ]

	    relative-part = "//" authority path-abempty
                          / path-absolute
                          / path-noscheme
                          / path-empty
	  </pre>
        </xs:documentation>
    </xs:annotation>
    <xs:restriction base="xs:anyURI">
      <xs:pattern value="&relative-ref;"/>
    </xs:restriction>
  </xs:simpleType>

4.7 Common constructs in the URI grammars

        4.7.1 The hierarchical part
        4.7.2 The relative part
        4.7.3 Authority information: user info, host, and port
        4.7.4 Paths and segments
        4.7.5 The query
        4.7.6 The fragment identifier
        4.7.7 Reserved, unreserved, and other character classes

This section outlines the ABNF rules and corresponding entity declarations for the constructs referred to by more than one of the constructs URI, URI-reference, relative-reference, or absolute-URI, in so far as these are different from the corresponding definitions used for IRIs.

The non-terminal scheme and several of the character classes used in [RFC 3986] are the same as those used in [RFC 3987] and have already been treated above (Common constructs in the URI grammars (§4.7)).

4.7.1 The hierarchical part

The non-terminal symbol hier-part describes the hierarchical part of a URI. Its ABNF definition is:

hier-part     = "//" authority path-abempty
              / path-absolute
              / path-rootless
              / path-empty

As with the corresponding IRI-related construct, the translation into regular-expression notation breaks the right-hand side into several smaller pieces and renders the empty string in the last branch of the disjunction as an optionality indicator for the whole construct.

51 Definition of hier-part

<!ENTITY hier-part-1
  "(//&authority;&path-abempty;))">

<!ENTITY hier-part-1
  "&path-absolute;|&path-rootless;|&path-empty;">

<!ENTITY hier-part 
  "(&hier-part-1;|&hier-part-2;)">

The various declarations relating to the hierarchical part are gathered together in the following fragment:

52 Definition of hierarchical part of URI

<!--* The hierarchical part of the URI:  authority and path *-->
<!--* Authority:  user info, host, port number *-->

《 55 Definition of host, etc. 》

《 54 Definition of authority, user info, and port 》

《 59 Definition of uninternationalized paths 》

《 51 Definition of hier-part 》

<!--* end of hier-part *-->

4.7.2 The relative part

The non-terminal relative-part is almost identical to hier-part, but it excludes the non-terminal path-rootless and adds path-noscheme.

The ABNF definition is:

relative-part = "//" authority path-abempty
              / path-absolute
              / path-noscheme
              / path-empty

The translation is similar to that for hier-part.

53 Definition of relative-part entity

<!ENTITY relative-part-1
"(//&authority;&path-abempty;)">
<!ENTITY relative-part-2
"&path-absolute;|&path-noscheme;">
<!ENTITY relative-part 
"(&relative-part-1;|&relative-part-2;)?">

<!--* Some regexp handlers turn out to have problems with
    * the trailing empty branch, so delete it and make the
    * entire expression optional instead. The bug has been
    * reported, but in the meantime let's work around it.
    *-->

4.7.3 Authority information: user info, host, and port

The authority portion of a URI identifies the authoritative host for a given resource, along with optional user and port information. The top-level construct, along with user and port information, is defined as follows in ABNF:

authority     = [ userinfo "@" ] host [ ":" port ]
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
port          = *DIGIT

Note that port is the same in [RFC 3986] and [RFC 3987].

The equivalent regular expressions and entities are these.

54 Definition of authority, user info, and port

<!ENTITY port "(&DIGIT;)*">

<!ENTITY userinfo 
  "([A-Za-z0-9\-\._~!$&amp;'()*+,;=:]|&pct-encoded;)*">
<!--* literal translation:
<!ENTITY userinfo 
  "((&unreserved;|&pct-encoded;|&sub-delims;|:))*">
*-->

<!ENTITY authority "(((&userinfo;@))?&host;((:&port;))?)">

The identification of the host in URIs is substantially the same as that in IRIs, except that instead of internationalized registered names, it accepts only uninternationalized registered names. IP literals and IPv4 values are the same in URIs and IRIs. The relevant ABNF productions are these.

host          = IP-literal / IPv4address / reg-name
reg-name      = *( unreserved / pct-encoded / sub-delims )
IP-literal    = "[" ( IPv6address / IPvFuture  ) "]"
IPvFuture     = "v" 1*HEXDIG "." 
                1*( unreserved / sub-delims / ":" )

The corresponding entity declarations are these.

55 Definition of host, etc.

<!--* Host:  the most elaborate part of the grammar.
    * reg-name, IPv4, IPv6, and IPvFuture.
    *-->

《 19 Definition of dec-octet 》

《 18 Definition of IPv4 and IPv6 》

<!ENTITY reg-name 
  "((&unreserved;|&pct-encoded;|&sub-delims;))*">

<!ENTITY host 
  "(&IP-literal;|&IPv4address;|&reg-name;)">

4.7.4 Paths and segments

There are several varieties of path, in a hierarchical or relative part of a URI or relative reference. In ABNF:

path          = path-abempty    ; begins with "/" or is empty
              / path-absolute   ; begins with "/" but not "//"
              / path-noscheme   ; begins with a non-colon segment
              / path-rootless   ; begins with a segment
              / path-empty      ; zero characters

The translation into entity notation is straightforward.

56 Definition of path entity

<!ENTITY path-1 
  "&path-abempty;|&path-absolute;|&path-noscheme;">
<!ENTITY path-2
  "&path-rootless;|&path-empty;">

<!ENTITY path 
 "(&path-1;|&path-2;)">

The individual forms of path are defined thus:

path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty    = 0<pchar>

The translation into entity notation is again straightforward.

57 Kinds of uninternationalized path

<!ENTITY path-abempty "((/&segment;))*">

<!ENTITY path-absolute "(/((&segment-nz;((/&segment;))*))?)">

<!ENTITY path-noscheme "(&segment-nz-nc;((/&segment;))*)">

<!ENTITY path-rootless "(&segment-nz;((/&segment;))*)">

<!ENTITY path-empty "">

Individual segments of a path are made up of (uninternationalized) path characters:

segment       = *pchar
segment-nz    = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"

The translation into entity notation follows the usual pattern.

58 Definition of uninternationalized segment entity, etc.

<!ENTITY segment "(&pchar;)*">

<!ENTITY segment-nz "(&pchar;)+">

<!ENTITY segment-nz-nc 
  "((&unreserved;|&pct-encoded;|&sub-delims;|@))+">

The declarations relating to paths are pulled together in the following fragment:

59 Definition of uninternationalized paths

<!--* Path (second major part of hier-part):  
       * first segments, then various kinds of path *-->

《 58 Definition of uninternationalized segment entity, etc. 》

《 57 Kinds of uninternationalized path 》

《 56 Definition of path entity 》

4.7.5 The query

query         = *( pchar / "/" / "?" )

60 Definition of uninternationalized query

<!--* Query part *-->

<!ENTITY query "((&pchar;|/|\?))*">

4.7.6 The fragment identifier

fragment      = *( pchar / "/" / "?" )

61 Definition of uninternationalized fragment

<!--* Fragment part *-->

<!ENTITY fragment "((&pchar;|/|\?))*">

4.7.7 Reserved, unreserved, and other character classes

The non-terminal pchar describes the characters usable in uninternationalized path expressions.

pchar         = unreserved / pct-encoded / sub-delims 
              / ":" / "@"

The definition pulls the literals ":" and "@" and both classes of delimiter into the same character class expression; otherwise it's a literal translation of the ABNF.

62 Definition of uninternationalized path characters

<!ENTITY pchar 
  "([A-Za-z0-9\-\._~!$&amp;'()*+,;=:@]|&pct-encoded;)">
<!--* literal translation:
<!ENTITY pchar 
  "(&unreserved;|&pct-encoded;|&sub-delims;|:|@)">
*-->

All these are pulled together by the following DTD fragment.

63 Definitions of character classes

<!--* Character classes, groups, what have you *-->

《 1 Definition of entities ALPHA, DIGIT, and HEXDIG 》
《 33 Definition of sub-delims 》
《 32 Definition of gen-delim 》
《 31 Definition of reserved 》
《 30 Definition of unreserved 》

<!--* pct-encoded isn't really a character class, but
    * it needs to fit in here before pchar
    *-->
《 2 Definition of pct-encoded 》
《 62 Definition of uninternationalized path characters 》

4.8 The `URI-RFC3986` schema document

        4.8.1 Overall structure
        4.8.2 The initial annotation
        4.8.3 The DTD internal subset
        4.8.4 Versioning policy

The URI-related types defined in this document are all formally defined by the schema document at http://www.w3.org/2001/03/XMLSchema/TypeLibrary-URI-3986.xsd, which gathers together the code fragments given above in a suitable order.

4.8.1 Overall structure

The overall structure of the schema document is as follows:

64 The URI-RFC3986 schema document

<?xml version="1.0"?>
《 36 XML stylesheet instruction 》
《 65 Document type declaration 》
<xs:schema 
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
  xmlns:lib = 
    "http://www.w3.org/2001/03/XMLSchema/TypeLibrary" 
  version="1.0" 
  elementFormDefault="qualified" 
  xml:lang="en" 
  targetNamespace =
    "http://www.w3.org/2001/03/XMLSchema/TypeLibrary">

  《 66 Description of the schema document 》 
  《 46 Simple type definition for URI-3986 》
  《 48 Simple type definition for absolute-URI-3986 》
  《 50 Simple type definition for relative-reference-3986 》
  《 44 Simple type definition for URI-reference-3986 》
  《 68 Versioning policy for URI-related types 》

</xs:schema>

The document-type declaration refers to the normative DTD for XSD schema documents, and again includes a fairly extensive internal DTD subset (described more fully below, The DTD internal subset (§4.8.3)).

65 Document type declaration

<!DOCTYPE xs:schema 
          PUBLIC "-//W3C//DTD XMLSchema 200102//EN" 
                 "http://www.w3.org/2001/XMLSchema.dtd" [

《 67 Internal DTD subset 》

]>

4.8.2 The initial annotation

The first xs:annotation element in the schema document provides a general description of the contents and origin of the document.

66 Description of the schema document

  <xs:annotation>
   <xs:documentation xmlns="http://www.w3.org/1999/xhtml">

     <h3>Introduction</h3>

     <p>This schema document describes a [draft]
     component of the XML Schema type library: datatypes for
     URIs as defined by RFC 3986.</p>
    
     <p>The types defined here check the conformance of
     literal strings against the regular expression given in
     Appendix A of <a
     href="http://www.ietf.org/rfc/rfc3986.txt">RFC
     3986</a>, translated into XSD notation.  See also
     the <a href="TypeLibrary-IRI-RFC3987.xsd">schema
     document for IRIs</a> located in the same directory
     as this document.</p>

     <p>Please send suggestions for improvements to
     www-xml-schema-comments@w3.org.  Mention the URI of
     this document: <code><a href="">
     http://www.w3.org/2011/04/XMLSchema/TypeLibrary-URI-3986.xsd
     </a></code> </p>

     <p>See below (at the bottom of this document) for
     information about the revision and namespace-versioning
     policy governing this schema document.</p>
     
   </xs:documentation>
  </xs:annotation>

4.8.3 The DTD internal subset

The internal subset of the DTD includes the entity declarations shown elsewhere in this document, in a suitable sequence.

67 Internal DTD subset

《 40 Miscellaneous element and attribute declarations 》

<!--* This schema document provides XSD patterns for URI,
    * URI-reference, and other constructs defined in RFC
    * 3986.  
    *-->

《 41 Initial explanatory comment 》
《 63 Definitions of character classes 》
《 12 Definition of scheme 》
《 52 Definition of hierarchical part of URI 》
《 60 Definition of uninternationalized query 》
《 61 Definition of uninternationalized fragment 》

<!--* Relative references *-->
《 53 Definition of relative-part entity 》
《 49 Definition of relative-ref 》

<!--* URIs, relative references, URI references *-->
《 45 Definition of URI entity 》
《 43 Definition of URI-reference entity 》
《 47 Definition of absolute-URI 》

4.8.4 Versioning policy

68 Versioning policy for URI-related types

  <xs:annotation>
  <xs:documentation>

    <h3>Versioning policy for this document</h3>
    
    <p>
      In keeping with the XML Schema WG's standard
      versioning policy, this schema document will 
      persist at the URI
      < http://www.w3.org/2012/01/XMLSchema/TypeLibrary-URI-3986.xsd >.
    </p>
    <p>
      At the date of issue it can also be found at
      <http://www.w3.org/2001/03/XMLSchema/TypeLibrary-URI-3986.xsd>.
      The schema document at that URI may however change in
      the future, in order to remain compatible with the
      latest version of XML Schema itself.  In other words,
      if the XML Schema namespace changes, the version of
      this document at &lt; 
      http://www.w3.org/2001/03/XMLSchema/TypeLibrary-URI-3986.xsd 
      &gt; will change accordingly; the version at &lt; 
      http://www.w3.org/2012/01/XMLSchema/TypeLibrary-URI-3986.xsd 
      &gt; will not change.
    </p>
    <p>
      Previous dated (and unchanging) versions of this
      schema document include:
     </p>
     <ul>
       <li>
       http://www.w3.org/2011/04/XMLSchema/TypeLibrary-URI-3986.xsd 
       </li>
     </ul>
    
  </xs:documentation>
  </xs:annotation>

XSD datatypes for strict validation of IRIs and URIs

W3C Working Group Note 19 January 2012

Abstract

Status of this Document

Table of Contents

Appendices