Copyright © 2012 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document defines two families of datatypes, one designed for strict checking of strings for conformance to the grammar for Internationalized Resource Identifiers (IRIs) defined in [RFC 3987], and the other for checking against the grammar for Uniform Resource Identifiers defined in [RFC 3986]. These datatypes can be used by any conforming XSD 1.0 or XSD 1.1 processor.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is a W3C Working Group Note as described in the World Wide Web Consortium Process Document. It defines several datatypes designed for strict checking of strings against the grammar for internationalized resource identifiers (IRIs) given in [RFC 3987] and the grammar for uniform resource identifiers (URIs) given in [RFC 3986]. These datatypes are included in the public type library maintained by the W3C XML Schema Working Group.
In its current state, this document contains a full description of the datatypes defined. It is substantially complete as a specification of the datatypes, though some further changes (listed in To-do list (non-normative) (§E)) may be made in a future revision of this document.
Comments on this document should be sent to the W3C XML Schema comments mailing list, www-xml-schema-comments@w3.org (archive). Each email message should contain only one comment.
Editorial Note: Mailing list? Or Bugzilla?
Publication as a Editors' Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document has been produced by the W3C XML Schema Working Group as part of the W3C XML Activity. The authors of this document are the members of the XML Schema Working Group.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document defines two families of datatypes, one designed for strict checking of strings for conformance to the grammar for Internationalized Resource Identifiers (IRIs) defined in [RFC 3987], and the other for checking against the grammar for Uniform Resource Identifiers defined in [RFC 3986]. These datatypes can be used by any conforming XSD 1.0 or XSD 1.1 processor.
Values of the anyURI
datatype defined by [XSD 1.0 Part 2: Datatypes] and [XSD 1.1 Part 2: Datatypes]
carry the semantic information that they are intended to be
IRIs, but the anyURI
datatype does not provide firm
assurance that they are in fact semantically or
syntactically correct. In [XSD 1.0 Part 2: Datatypes],
the type's lexical space is defined indirectly as the set of
strings which, taken as input to an algorithm defined in
[XML Linking Language], produce output strings which are
"legal URIs" according to [RFC 2396]. Empirical studies show variation in the
strictness with which XSD 1.0 processors enforce the
syntactic constraints thus described, and in any case
[RFC 2396] has since been made obsolete and
been replaced by other specifications of URI syntax. In
[XSD 1.0 Part 2: Datatypes], the anyURI
datatype is
loosely, not tightly, coupled to the defining documents for
IRIs and URIs (which were [RFC 3987] and [RFC 3986] at the
time this document was published). No syntactic checks on
values of anyURI
are prescribed, and the value space is
described as the set of finite-length sequences of XML
characters.
So while declaring an element or attribute as having
type anyURI
can provide a useful clue as to the meaning of
the element or attribute, it does not provide any guarantees
of semantic or syntactic correctness.
Checking that IRIs and IRI references are semantically
correct is beyond the capacity of current automated systems.
But in some contexts, it is likely to be helpful to check to
see that anyURI
values are in fact syntactically
acceptable IRIs. There are limits to what is practical in
this area: the syntactic rules for URIs (and thus for IRIs)
depend on the URI scheme, and the set of recognized URI
schemes is subject to change, so it is impractical to define
a stable, unchanging type which checks candidate values
against all the relevant rules. But values can be checked
against the generic syntax for URIs and IRIs specified in
[RFC 3986] and [RFC 3987]; such checks will not detect all
errors in all ill-formed strings, but they will detect many.
This document defines a number of IRI- and URI-related
datatypes by systematically translating the augmented
Backus-Naur Form (ABNF) grammar used in the RFCs into the
regular-expression notation used in the XSD pattern facet.
Some applications and some XML vocabularies may impose further
constraints on IRI usage: in some contexts (for example in
setting a base IRI for resolution of relative references) it
may be a requirement that the IRI provided be an absolute IRI,
not a relative reference. This requires checking not against
the generic syntax for IRI references (which is what is
usually wanted for values intended to be IRIs) but against the
more restrictive grammar of absolute-IRI
.
This document defines several XSD datatypes corresponding to
various subsets of IRIs. Most XML vocabularies, whether
intended to encode information for consumption by humans or by
machines, should use either anyURI
or an appropriate
IRI-based datatype. For completeness, however, and for use in
the specialized situations where they are appropriate,
analogous datatypes for URIs are also defined. The URI-based
datatypes should be used only where there are compelling
technical considerations that require the use of URIs and not
IRIs.
The primary purpose of this document is to provide the formal definitions, in XSD notation, for the IRI- and URI-related datatypes mentioned above, in such a way as to enable interested readers to verify the equivalence between the regular expressions used to define them and the ABNF grammars used in [RFC 3986] and [RFC 3987]. This document does not attempt to describe the purpose and correct use of IRIs or URIs, or to address any of the issues relating to the internationalization of resource identifiers (or to internationalization in general). Readers seeking such guidance should consult other sources of information. The W3C Internationalization Activity has an extensive set of documents with information about internationalization.
The datatypes defined here use the XSD pattern facet to constrain the lexical space to strings matching the appropriate construct in the ABNF grammars of [RFC 3987] and [RFC 3986]. (This is not possible for arbitrary ABNF grammars, because XSD patterns use regular expressions and thus define regular languages, while in the general case ABNF grammars define context-free languages. In the case of [RFC 3986] and [RFC 3987], the languages defined are regular, not context-free, and can be represented by XSD patterns without loss of any constraints.)
The translation of ABNF constructs (as defined in [RFC 2234] and [RFC 5234] and used in [RFC 3987] and [RFC 3986]) into XSD regular expressions is largely mechanical, but can be tedious and error-prone, and the resulting regular expressions are very long. To make it easier to verify the regular expressions against the ABNF grammar, this document builds up the regular expressions piece by piece, defining an XML entity for each non-terminal symbol in the ABNF grammar. The simple correspondence between entity declarations and ABNF productions makes it easier to check that the translation is correct. Both the ABNF productions and the entity declarations are presented in small blocks of code that can be compared individually. (For a brief description of the notation and display style used, see The literate-programming notation used here (§B).)
ALPHA
, DIGIT
,
and HEXDIG
,
defined in [RFC 2234] thus:
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z DIGIT = %x30-39 ; 0-9 HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"In [RFC 2234], each of these non-terminals denotes not a set of printable symbols but a set of integers. Section 2.3 of [RFC 2234] specifies: "Rules resolve into a string of terminal values, sometimes called characters. In ABNF a character is merely a non-negative integer."
Both [RFC 3987] and [RFC 3986] re-define the terminal symbols
of ABNF as denoting characters, not integers (using the
integer code points of the ISO 10646 / Unicode Universal
Character Set to perform the integer → character
mapping). So ALPHA
, DIGIT
, and
HEXDIG
can be translated into the regular
expressions captured in the following entity declarations:
<!ENTITY ALPHA "([A-Za-z])"> <!ENTITY DIGIT "[0-9]"> <!ENTITY HEXDIG "[0-9A-Fa-f]">
pct-encoded
is defined thus:
pct-encoded = "%" HEXDIG HEXDIGThis can be translated into an XSD regular expression using a reference to the
HEXDIG
entity
defined elsewhere:
The entity reference to &HEXDIG;
here
corresponds directly to the use of the non-terminal
HEXDIG
in the ABNF; the entity declaration
is slightly easier to verify in this form than an
equivalent declaration with the entity reference
already expanded:
The greater ease of verification is particularly valuable
for higher level constructs. The full regular expression
pattern for the non-terminal IRI
is over three
thousand characters long, and would be very tedious to
verify in that form.
Following the pattern of [RFC 3987] and [RFC 3986], this document will discuss the grammar in a generally top-down sequence. The schema document being defined follows a different order; it defines the entities bottom-up, to work around bugs in some widely used XML parsers.
Note that in the ABNF grammars of [RFC 3987] and [RFC 3986], some productions are ambiguous. The "first-match-wins" (or "greedy") matching algorithm applies. For details, see [RFC 3986]. The greedy-match rule does not affect the translation of the grammar into regular expressions for purposes of validating strings. If a string matches the ABNF grammar in more than one way, the greedy-match rule determines which internal structure to assign to the string, but it does not affect the membership of any string in the language defined by the grammar.
The value space of each of the types defined in this section is the set of strings recognized by the corresponding grammatical production in [RFC 3987]; the production used for each type is identified in the section on that type.
The lexical mapping for these types, as for all datatypes
derived from anyURI
by restriction, is the identity
mapping.
anyURI
.
IRI-reference-3987
datatype
The IRI-reference-3987
datatype includes all
those strings which match the non-terminal
IRI-reference
in the ABNF grammar of
[RFC 3987]; this includes both absolute and relative IRIs,
with and without fragment identifiers.
This is the datatype appropriate when it is desired to
require that a string be a (potentially) legal resource
identifier without further restrictions.
IRI-reference = IRI / irelative-refThat is, an IRI reference is either an IRI or an internationalized relative reference. The grammar rule can be translated into a regular expression; the corresponding entity declaration is:
The simple type definition for
IRI-reference-3987
, however, does not use the
entity so defined; instead, it defines the datatype as the
union of two separately defined types,
IRI-3987
and
relative-reference-3987
. The lexical and
value spaces so identified are the same, but defining the
type as a union makes more explicit the relation between
the class of IRI references and the two subclasses which
make it up.
<xs:simpleType name="IRI-reference-3987"> <xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <p> The <tt>IRI-reference-3987</tt> type checks the string against the regex grammar for IRI references in RFC 3987 Section 2.2. This is the one most users are likely to want when they say they want a generic URI or IRI type. </p> <p>The rule in the grammar is:</p> <pre> IRI-reference = IRI / irelative-ref </pre> <p>Rather than write this as a single pattern, however, we will just take a union of the two types already defined.</p> </xs:documentation> </xs:annotation> <xs:union memberTypes="lib:IRI-3987 lib:relative-reference-3987"/> </xs:simpleType>
IRI-3987
datatype
The IRI-3987
datatype includes all those
strings which match the non-terminal IRI
in
the ABNF grammar of [RFC 3987]; this includes absolute IRIs
with and without fragment identifiers. It excludes
relative references and is thus appropriate only
under special circumstances.
IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ]An IRI consists of a scheme, a colon, and an internationalized hierarchical part, optionally followed by a literal question mark and an internationalized query, and then (again optionally) by a literal hash mark and an internationalized fragment. The equivalent regular expression is used as the replacement text for the entity
IRI
:
The simple type definition for the IRI-3987
datatype restricts the built-in anyURI
type by
requiring that values conform to the pattern defined
by the regular expression in the replacement text of the
entity IRI
.
<xs:simpleType name="IRI-3987"> <xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <p> The IRI-3987 type checks the string against the regex grammar for IRI in RFC 3987 Section 2.2. </p> <p> Note that the grammar for IRI is essentially the same as that for absolute IRIs, with the addition of an optional hash mark (#) and fragment identifier: </p> <pre> IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ] </pre> </xs:documentation> </xs:annotation> <xs:restriction base="xs:anyURI"> <xs:pattern value="&IRI;"/> </xs:restriction> </xs:simpleType>
The hierarchical part, query, and fragment can also occur in other top-level constructs; they are described in later sections (The hierarchical part (§3.8.2), The query (§3.8.6), and The fragment identifier (§3.8.7), respectively).
absolute-iri-3987
datatypeThe datatype absolute-IRI-3987
includes
all and only those strings which match the
absolute-IRI
grammar production of
[RFC 3987].
absolute-IRI = scheme ":" ihier-part [ "?" iquery ]This differs from the
IRI
construct
only in omitting the optional hash mark and fragment identifier.
The corresponding entity declaration is:
The simple type definition defines
absolute-IRI-3987
as a restriction of anyURI
to the
strings matching the pattern.
<xs:simpleType name="absolute-IRI-3987"> <xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <p> The <tt>absolute-IRI-3987</tt> type checks the string against the regex grammar for absolute IRIs in RFC 3987 Section 2.2. </p> <p>The grammar is very like that for IRI, but it does not allow a fragment identifier.</p> </xs:documentation> </xs:annotation> <xs:restriction base="xs:anyURI"> <xs:pattern value="&absolute-IRI;"/> </xs:restriction> </xs:simpleType>
relative-reference-3987
datatypeThe datatype relative-reference-3987
includes the set of internationalized relative
references, which are all and only those strings
which match the irelative-ref
production of [RFC 3987].
irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ]The corresponding entity declaration is:
The datatype relative-reference-3987
is
unlikely to be of general utility, as it includes
only IRI references relative to the base IRI of a
given resource. The type is defined and given a name
here primarily to simplify the definition of the
IRI-reference
datatype (defined above,
The IRI-reference-3987
datatype (§3.4)).
As with the other datatypes defined here, it restricts
anyURI
by restricting the lexical space to those
strings matching the pattern.
<xs:simpleType name="relative-reference-3987"> <xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <p> The <tt>relative-reference-3987</tt> type checks the string against the regex grammar for relative references in RFC 3987 Section 2.2. </p> <p>The top-level rules in the grammar are:</p> <pre> irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ] irelative-part = "//" iauthority ipath-abempty / ipath-absolute / ipath-noscheme / ipath-empty </pre> </xs:documentation> </xs:annotation> <xs:restriction base="xs:anyURI"> <xs:pattern value="&irelative-ref;"/> </xs:restriction> </xs:simpleType>
This section outlines the ABNF rules and corresponding
entity declarations for the constructs referred to by
more than one of the constructs
IRI
,
IRI-reference
,
irelative-reference
, or
absolute-IRI
.
scheme
the same way:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
<!ENTITY scheme "(&ALPHA;((&ALPHA;|&DIGIT;|\+|-|\.))*)">In the interests of more compact regular expressions, however, the entity
scheme
is defined
in an equivalent but terser way:
ihier-part
describes the hierarchical part of an IRI. Its
ABNF definition is:
ihier-part = "//" iauthority ipath-abempty / ipath-absolute / ipath-rootless / ipath-empty
ihier-part
breaks this declaration up into
four parts, one for each line of the ABNF. A
straightforward translation would be as follows.
<!ENTITY ihp-1 "(//&iauthority;&ipath-abempty;)"> <!ENTITY ihp-2 "&ipath-absolute;"> <!ENTITY ihp-3 "&ipath-rootless;"> <!ENTITY ihp-4 "&ipath-empty;"> <!ENTITY ihier-part "(&ihp-1;|&ihp-2;|&ihp-3;|&ihp-4;)">
irelative-part
expands to the empty string.
ipath-empty = 0<ipchar>This can be rendered as the following
<!ENTITY ipath-empty ""> <!--* ... *--> <!ENTITY ihp-4 "&ipath-empty;">Because
ipath-empty
expands to the empty
string, however (as does, in consequence, also
ihp-4
), this is effectively the same as the
following construct:
<!ENTITY ihier-part "(&ihp-1;|&ihp-2;|&ihp-3;|)">The empty branch is legal in XSD regular expressions, but at least one widely used XSD validator has, in some versions, an error which causes it not to interpret the trailing empty branch correctly. The definition of
ihier-part
works around this problem by using an alternative
formulation which omits the empty
branch and makes the entire construct optional.
The various declarations relating to the hierarchical part are gathered together in the following fragment:
<!--* The hierarchical part of the IRI: authority and path *-->
<!--* Authority: user info, host, port number *-->
《 17 Definition of ihost, etc. 》
《 16 Definition of iauthority and port 》
《 23 Definition of internationalized paths 》
《 13 Definition of ihier-part
》
<!--* end of hier-part *-->
The non-terminal irelative-part
is almost
identical to ihier-part
, but it excludes
the non-terminal ipath-rootless
and adds
ipath-noscheme
.
irelative-part = "//" iauthority ipath-abempty / ipath-absolute / ipath-noscheme / ipath-empty
Like the translation of ihier-part
, the
rendering of this rule breaks up the right-hand side
into parts, to keep the line-length manageable.
Again, the empty branch is represented by an
optionality marker on the expression as a whole,
rather than as a separate branch.
<!ENTITY irp-1 "(//&iauthority;&ipath-abempty;)"> <!ENTITY irp-2 "&ipath-absolute;"> <!ENTITY irp-3 "&ipath-noscheme;"> <!ENTITY irp-4 "&ipath-empty;"> <!ENTITY irelative-part "(&irp-1;|&irp-2;|&irp-3;)?"> <!--* Some regexp handlers turn out to have * problems with the trailing empty branch, * so we delete it and make the entire * expression optional instead. The bug has been * reported, but in the meantime let's work around it. *-->
iauthority = [ iuserinfo "@" ] ihost [ ":" port ] iuserinfo = *( iunreserved / pct-encoded / sub-delims / ":" ) port = *DIGIT
The equivalent regular expressions and entities are these.
<!ENTITY port "&DIGIT;*"> <!ENTITY port "(&DIGIT;)*"> <!ENTITY iuserinfo "([&pcg-iunreserved;&pcg-sub-delims;:]|&pct-encoded;)*"> <!ENTITY iauthority "(((&iuserinfo;@))?&ihost;((:&port;))?)">
iuserinfo
this way:
<!ENTITY iuserinfo "((&iunreserved;|&pct-encoded;|&sub-delims;|:))*">Here as in some other places the regular expressions merge a disjunction of character classes into a single character class. So instead of separate references to
iunreserved
and sub-delims
,
the definition of iuserinfo
makes a single
character class, with references to the positive
character groups for those non-terminals. (For
any non-terminal N which is logically a
character class, an entity named
pcg-
N denotes
the positive character group used to define N
(in these cases the positive character group is,
informally, the character class without the enclosing
square brackets).
ihost = IP-literal / IPv4address / ireg-name ireg-name = *( iunreserved / pct-encoded / sub-delims ) IP-literal = "[" ( IPv6address / IPvFuture ) "]" IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )An internationalized host name is an IP literal, an IPv4 address, or an
ireg-name
(internationalized
registered name).
An internationalized registered name is a sequence of
zero or more unreserved characters, sub-delimiters,
or percent-encoded characters.
An IP literal is an IPv6 or an IPvFuture address
enclosed in square brackets.
An IPvFuture address is a sequence of one or more
unreserved or sub-delimiter characters, preceded
by "c
", one or more hex
digits, and a full stop.
The corresponding entity declarations are these.
<!--* Host: the most elaborate part of the grammar.
* reg-name, IPv4, IPv6, and IPvFuture.
*-->
《 19 Definition of dec-octet
》
《 18 Definition of IPv4 and IPv6 》
<!ENTITY ireg-name
"((&iunreserved;|&pct-encoded;|&sub-delims;))*">
<!ENTITY ihost
"(&IP-literal;|&IPv4address;|&ireg-name;)">
IPv6address = 6( h16 ":" ) ls32 / "::" 5( h16 ":" ) ls32 / [ h16 ] "::" 4( h16 ":" ) ls32 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 / [ *4( h16 ":" ) h16 ] "::" ls32 / [ *5( h16 ":" ) h16 ] "::" h16 / [ *6( h16 ":" ) h16 ] "::" h16 = 1*4HEXDIG ls32 = ( h16 ":" h16 ) / IPv4address IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
The corresponding entities are these. For legibility
(shorter line length), each line of the rule for IPv6
is translated into a separate entity, and these entities
are then aggregated. (For the same reason, the entity
octet
is introduced to give a shorter name
for the content of dec-octet
.)
<!ENTITY octet "&dec-octet;"> <!ENTITY IPv4address "(&octet;\.&octet;\.&octet;\.&octet;)"> <!ENTITY h16 "&HEXDIG;{0,4}"> <!ENTITY h16 "(&HEXDIG;){0,4}"> <!ENTITY ls32 "((&h16;:&h16;)|&IPv4address;)"> <!ENTITY IPv6-1 "((((&h16;:)){6}&ls32;)"> <!ENTITY IPv6-2 "(::((&h16;:)){5}&ls32;)"> <!ENTITY IPv6-3 "((&h16;)?::((&h16;:)){4}&ls32;)"> <!ENTITY IPv6-4 "(((((&h16;:))?&h16;))?::((&h16;:)){3}&ls32;)"> <!ENTITY IPv6-5 "(((((&h16;:)){0,2}&h16;))?::((&h16;:)){2}&ls32;)"> <!ENTITY IPv6-6 "(((((&h16;:)){0,3}&h16;))?::&h16;:&ls32;)"> <!ENTITY IPv6-7 "(((((&h16;:)){0,4}&h16;))?::&ls32;)"> <!ENTITY IPv6-8 "(((((&h16;:)){0,5}&h16;))?::&h16;)"> <!ENTITY IPv6-9 "(((((&h16;:)){0,6}&h16;))?::))"> <!ENTITY IPv6-1-3 "&IPv6-1;|&IPv6-2;|&IPv6-3;"> <!ENTITY IPv6-4-6 "&IPv6-4;|&IPv6-5;|&IPv6-6;"> <!ENTITY IPv6-6-9 "&IPv6-7;|&IPv6-8;|&IPv6-9;"> <!ENTITY IPv6address "&IPv6-1-3;|&IPv6-4-6;|&IPv6-6-9;"> <!ENTITY IPvFuture "(v&HEXDIG;+\.[&pcg-unreserved;&pcg-sub-delims;:]+)"> <!ENTITY IP-literal "(\[(&IPv6address;|&IPvFuture;)\])">
IPvFuture
combines
multiple non-terminals into a single character class
in the fashion described above.
dec-octet
formally. The ABNF
allows numerals for any integer between 0
and 255, inclusive, and forbids unnecessary leading
zeros.
dec-octet = DIGIT ; 0-9 / %x31-39 DIGIT ; 10-99 / "1" 2DIGIT ; 100-199 / "2" %x30-34 DIGIT ; 200-249 / "25" %x30-35 ; 250-255
The equivalent regular expressions are these.
ipath = ipath-abempty ; begins with "/" or is empty / ipath-absolute ; begins with "/" but not "//" / ipath-noscheme ; begins with a non-colon segment / ipath-rootless ; begins with a segment / ipath-empty ; zero characters
The translation into entity notation makes separate entities for each line of the ABNF rule, solely for legibility reasons.
ipath
entity<!ENTITY ip-1 "&ipath-abempty;"> <!ENTITY ip-2 "&ipath-absolute;"> <!ENTITY ip-3 "&ipath-noscheme;"> <!ENTITY ip-4 "&ipath-rootless;"> <!ENTITY ip-4 "&ipath-empty;"> <!ENTITY ipath "(&ip-1;|&ip-2;|&ip-3;|&ip-4;|&ip-5;)">
ipath-abempty = *( "/" isegment ) ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ] ipath-noscheme = isegment-nz-nc *( "/" isegment ) ipath-rootless = isegment-nz *( "/" isegment ) ipath-empty = 0<ipchar>
The translation into entity notation is straightforward.
<!ENTITY ipath-abempty "((/&isegment;))*"> <!ENTITY ipath-absolute "(/((&isegment-nz;((/&isegment;))*))?)"> <!ENTITY ipath-noscheme "(&isegment-nz-nc;((/&isegment;))*)"> <!ENTITY ipath-rootless "(&isegment-nz;((/&isegment;))*)"> <!ENTITY ipath-empty "">
isegment = *ipchar isegment-nz = 1*ipchar isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":"
The translation into entity notation is again straightforward.
<!ENTITY isegment "(&ipchar;)*"> <!ENTITY isegment-nz "(&ipchar;)+"> <!ENTITY isegment-nz-nc "([&pcg-iunreserved;&pcg-sub-delims;@]|&pct-encoded;)+"> <--* literal rendering: <!ENTITY isegment-nz-nc "((&iunreserved;|&pct-encoded;|&sub-delims;|@))+"> *-->
The declarations relating to paths are pulled together in the following fragment:
<!--* Path (second major part of hier-part):
* first segments, then various kinds of path *-->
《 22 Definition of isegment entity, etc. 》
《 21 Kinds of ipath 》
《 20 Definition of ipath
entity 》
iquery = *( ipchar / iprivate / "/" / "?" )
iquery
<!--* Query part *--> 《 29 Definition of iprivate entity 》 <!ENTITY iquery "(&ipchar;|[&pcg-iprivate;/?])*"> <!ENTITY iquery "((&ipchar;|&iprivate;|/|\?))*">
The lowest-level constructs in the grammar are the definitions of reserved character, unreserved character, and other character classes. This section presents the ABNF definitions of the classes and their regular-expression equivalents.
ipchar
describes the
characters usable in internationalized path expressions.
ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@"
The definition pulls the literals
":
"
and "@
" into
the same character class expression as
the sub-delimiters; otherwise it's a
literal translation of the ABNF.
<!ENTITY ipchar "(&iunreserved;|&pct-encoded;|[&pcg-sub-delims;:@])"> <!--* Literal translation of ABNF: <!ENTITY ipchar "(&iunreserved;|&pct-encoded;|&sub-delims;|:|@)"> *-->
iunreserved
class of characters
extends the unreserved
class of
[RFC 3986] by adding the set of legal UCS characters.
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
The translation groups all the characters in the class into a single character group, instead of translating the ABNF disjunction into a disjunction.
<!ENTITY pcg-iunreserved "&pcg-unreserved;&UCS_0;&UCS_4;&UCS_8;&UCS_C;"> <!ENTITY iunreserved "[&pcg-iunreserved;]"> <!--* literal translation of ABNF <!ENTITY iunreserved "(&ALPHA;|&DIGIT;|-|\.|_|~|&ucschar;)"> *-->
ucschar
contains
all the legal code points of UCS-2 except those
in the 7-bit ASCII / ISO 646 range, which are not all
allowed and which have in any case already
been dealt with.
ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD / %xD0000-DFFFD / %xE1000-EFFFDNote that this definition treats excludes both the private use areas and the surrogate code points in the range %xD800-DFFF; it includes the characters %x10000-EFFFD which lie outside the basic multilingual plane.
The translation uses several levels of entity redirection in an effort to keep the DTD more legible.
<!ENTITY ucs_01 " -퟿" > <!ENTITY ucs_02 "豈-﷏" > <!ENTITY ucs_03 "ﷰ-￯" > <!ENTITY ucs_10 "𐀀-🿽" > <!ENTITY ucs_20 "𠀀-𯿽" > <!ENTITY ucs_30 "𰀀-𿿽" > <!ENTITY ucs_40 "񀀀-񏿽" > <!ENTITY ucs_50 "񐀀-񟿽" > <!ENTITY ucs_60 "񠀀-񯿽" > <!ENTITY ucs_70 "񰀀-񿿽" > <!ENTITY ucs_80 "򀀀-򏿽" > <!ENTITY ucs_90 "򐀀-򟿽" > <!ENTITY ucs_A0 "򠀀-򯿽" > <!ENTITY ucs_B0 "򰀀-򿿽" > <!ENTITY ucs_C0 "󀀀-󏿽" > <!ENTITY ucs_D0 "󐀀-󟿽" > <!ENTITY ucs_E0 "󡀀-󯿽" > <!ENTITY UCS_0 "&ucs_01;&ucs_02;&ucs_03;&ucs_10;&ucs_20;&ucs_30;"> <!ENTITY UCS_4 "&ucs_40;&ucs_50;&ucs_60;&ucs_70;"> <!ENTITY UCS_8 "&ucs_80;&ucs_90;&ucs_A0;&ucs_B0;"> <!ENTITY UCS_C "&ucs_C0;&ucs_D0;&ucs_E0;"> <!ENTITY ucschar "[&UCS_0;&UCS_4;&UCS_8;&UCS_C;]">
iprivate
recognizes the
characters in the private use areas of UCS. It is used
only by iquery
, but conceptually it seems
better to deal with it here together with the other
UCS-based classes.
iprivate = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFDThe translation is straightforward, though it uses one level of indirection through a
pcg-
entity, and another level of indirection for the
individual ranges.
<!ENTITY pcg-range1 "-" > <!ENTITY pcg-range2 "󰀀-󿿽" > <!ENTITY pcg-range3 "􀀀-􏿽" > <!ENTITY pcg-iprivate "&pcg-range1;&pcg-range2;&pcg-range3;" > <!--* literal translation: <!ENTITY pcg-iprivate "-󰀀-󿿽􀀀-􏿽"> *--> <!ENTITY iprivate "[&pcg-iprivate;]" > <!--* literal translation: <!ENTITY iprivate "([-]|[-]|[-])"> *-->
unreserved
character class in
[RFC 3987] is taken over without change from [RFC 3986]:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
For unreserved
, a separate
entity is defined for the positive character group,
to allow it to be combined with other positive
character groups in other entity declarations.
<!ENTITY pcg-unreserved "A-Za-z0-9\-\._~"> <!ENTITY unreserved "[&pcg-unreserved;]"> <!--* literal translation of the ABNF: <!ENTITY unreserved "(&ALPHA;|&DIGIT;|-|\.|_|~)"> *-->
reserved = gen-delims / sub-delims
The definition of reserved
can use
the pcg-
entities defined below for the
two delimiter classes.
<!ENTITY reserved "[&pcg-gen-delims;&pcg-sub-delims;]"> <!--* literal translation of the ABNF: <!ENTITY reserved "(&gen-delims;|&sub-delims;)"> *-->
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
We define gen-delims
indirectly,
via pcg-gen-delims
; this allows
the class to be combined with other classes in
simpler regular expressions.
<!ENTITY pcg-gen-delims ":/?#\[\]@"> <!ENTITY gen-delims "[&pcg-gen-delims;]"> <!--* literal translation of the ABNF: <!ENTITY gen-delims "(:|/|\?|#|\[|\]|@)"> *-->
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Since sub-delims
is often combined
with other classes of characters in disjunctions,
it's helpful to define an entity for the
positive character group used in its character-class
expression. So we define pcg-sub-delims
for that positive character group, and define
sub-delims
as "[&pcg-sub-delims;]
".
<!--* pcg-sub-delims: the 'positive character group' in * sub-delims * (We give it a name to make it more easily reusable.) *--> <!ENTITY pcg-sub-delims "!$&'()*+,;="> <!ENTITY sub-delims "[&pcg-sub-delims;]"> <!--* literal translation: <!ENTITY sub-delims "(!|$|&|'|\(|\)|\*|\+|,|;|=)"> *-->
All these are pulled together by the following DTD fragment.
<!--* Character classes, groups, what have you *--> <!--* These are all unchanged from RFC 3986, * except for ipchar and iunreserved, which are * internationalized versions of pchar and unreserved. *--> 《 1 Definition of entities ALPHA, DIGIT, and HEXDIG 》 《 33 Definition of sub-delims 》 《 32 Definition of gen-delim 》 《 31 Definition of reserved 》 《 30 Definition of unreserved 》 《 28 Definition of UCS character class 》 《 27 Definition of iunreserved character class 》 <!--* pct-encoded isn't really a character class, but * it needs to fit in here before ipchar *--> 《 2 Definition of pct-encoded 》 《 26 Definition of ipchar 》
IRI-RFC3987
schema documentThe IRI-related types defined in this document are all formally defined by the schema document at http://www.w3.org/2001/03/XMLSchema/TypeLibrary-IRI-3987.xsd, which gathers together the code fragments given above in a suitable order.
The overall structure of the schema document is as follows:
<?xml version="1.0"?> 《 36 XML stylesheet instruction 》 《 37 Document type declaration 》 <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:lib = "http://www.w3.org/2001/03/XMLSchema/TypeLibrary" version="1.0" elementFormDefault="qualified" xml:lang="en" targetNamespace = "http://www.w3.org/2001/03/XMLSchema/TypeLibrary"> 《 38 Description of the schema document 》 《 7 Simple type definition for IRI-3987 》 《 9 Simple type definition for absolute-IRI-3987 》 《 11 Simple type definition for rel-ref-3987 》 《 5 Simple type definition for IRI-reference-3987 》 《 42 Versioning policy for IRI-related types 》 </xs:schema>
To make the XSD schema document display more legibly in Web browsers, we specify an XML stylesheet instruction pointing to an XSLT stylesheet for XSD schema documents.
<?xml-stylesheet href="http://www.w3.org/2008/09/xsd.xsl" type="text/xsl"?>
The document-type declaration refers to the normative DTD for XSD schema documents, and includes a fairly extensive internal DTD subset (described more fully below, The DTD internal subset (§3.9.3)).
<!DOCTYPE xs:schema PUBLIC "-//W3C//DTD XMLSchema 200102//EN" "http://www.w3.org/2001/XMLSchema.dtd" [ 《 39 Internal DTD subset 》 ]>
The first xs:annotation
element in the
schema document provides a general description of the
contents and origin of the document.
<xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <h3>Introduction</h3> <p>This schema document describes a [draft] component of the XML Schema type library: datatypes for IRIs as defined by RFC 3987.</p> <p>The types defined here check the conformance of literal strings against the grammar given in section 2.2 of <a href = "http://www.ietf.org/rfc/rfc3987.txt">RFC 3987</a>, translated into XSD notation. See also the <a href="TypeLibrary-URI-RFC3986.xsd">schema document for URIs</a> located in the same directory as this document. </p> <p>Please send suggestions for improvements to www-xml-schema-comments@w3.org. Mention the URI of this document: <code><a href= "http://www.w3.org/2012/01/XMLSchema/TypeLibrary-IRI-3987.xsd"> http://www.w3.org/2012/01/XMLSchema/TypeLibrary-IRI-3987.xsd </a></code></p> <p>See below (at the bottom of this document) for information about the revision and namespace-versioning policy governing this schema document.</p> </xs:documentation> </xs:annotation>
The internal subset of the DTD includes the entity declarations shown elsewhere in this document, in a suitable sequence.
《 40 Miscellaneous element and attribute declarations 》 《 41 Initial explanatory comment 》 《 34 Definitions of character classes 》 《 12 Definition ofscheme
》 《 52 Definition of hierarchical part of URI 》 《 24 Definition ofiquery
》 《 25 Definition ofifragment
》 <!--* Relative references *--> 《 15 Definition of irelative-part entity 》 《 10 Definition ofirelative-ref
》 <!--* IRIs, relative references, IRI references *--> 《 6 Definition ofIRI
entity 》 《 4 Definition ofIRI-reference
entity 》 《 8 Definition ofabsolute-IRI
》
Because the DTD for schema documents does not
include elements suitable for use within
the xs:documentation
element,
we define p
here. We also
declare the namespace attribute xmlns
for the xs:documentation
element.
<!ATTLIST xs:documentation xmlns CDATA #IMPLIED> <!ELEMENT p (#PCDATA)>
Editorial Note: Are these declarations necessary? useful? They look a bit like an early effort to make the document suitable for editing in a DTD-driven editor, which may have been abandoned before completion. We should either make them complete (which means bringing in suitable XHTML modules) or suppress them.
The long internal DTD subset is likely to confuse some readers unless we explain what we are doing and why there are so many entity declarations. (From the XML specification's point of view, we could isolate the complex sequence of entity declarations in a separate DTD file, but in that case many Web browsers would fail to display the document usefully.)
<!--* The regex patterns will be rather complicated, and * will be hard to verify and debug if we're not careful. * So we build the regexes systematically by transforming * the ABNF grammar of the RFC into entity declarations: * references to literals turn into literals, and * references to non-terminals turn into entity * references. (Don't try this with a context-free * grammar; you'll get circular entity references.) *--> <!--* We give the entity declarations in a bottom-up * order, because some XML parsers make the mistake of * trying to expand the entities when reading the entity * declaration, and want declaration before use. (DV, * listen to me when I am talking to you.) * * When multiple entity declarations are given, the last * one shown is the one created by the mechanical * translation. The earlier ones are manual * reformulations of the expression mostly for * compactness and clarity, and occasionally to fix * problems with character escaping. *-->
The versioning policy for this schema document is the same as for documents in the W3C Technical Reports area: there is a single standard location for the schema document, which will also contain the most recent version of the document approved by the Working Group, and for each revision of the document there is a dated version, which will not change in any substantive way.
<xs:annotation> <xs:documentation> <h3>Versioning policy for this document</h3> <p> In keeping with the XML Schema WG's standard versioning policy, this schema document will persist at the URI < http://www.w3.org/2012/01/XMLSchema/TypeLibrary-IRI-3987.xsd >. </p> <p> At the date of issue it can also be found at http://www.w3.org/2001/03/XMLSchema/URI-3987.xsd. The schema document at that URI may however change in the future, in order to remain compatible with the latest version of XML Schema itself. In other words, if the XML Schema namespace changes, the version of this document at < http://www.w3.org/2001/03/XMLSchema/TypeLibrary-IRI-3987.xsd > will change accordingly; the version at < http://www.w3.org/2012/01/XMLSchema/TypeLibrary-IRI-3987.xsd > will not change. </p> <p> Previous dated (and unchanging) versions of this schema document include: </p> <ul> <li> http://www.w3.org/2011/04/XMLSchema/TypeLibrary-IRI-3987.xsd </li> </ul> </xs:documentation> </xs:annotation>
This section of this document defines a set of datatypes based on [RFC 3986], which accept only values which match the syntactic definition of URIs in that document. Because these datatypes do not accept characters outside the 7-bit range of ASCII and ISO 646, they are not suitable for general-purpose use in human-readable documents. They should be used only in contexts where it is necessary to require that the mechanical transformation from IRI to URI has already been performed.
The value space of each of the types defined in this section is the set of strings recognized by the corresponding grammatical production in [RFC 3986]; the production used for each type is identified in the section on that type.
The lexical mapping and facet information for these types is the same as described above for the IRI types in Lexical Mapping (§3.2) and Facets (§3.3).
URI-reference-3986
datatype
The URI-reference-3986
datatype includes all
those strings which match the non-terminal
URI-reference
in the ABNF grammar of
[RFC 3986]; this includes both absolute and relative URIs,
with and without fragment identifiers.
URI-reference = URI / relative-refThat is, a URI reference is either a URI or a relative reference. The grammar rule can be translated into a regular expression; the corresponding entity declaration is:
Like the analogous IRI-based type, however, the
simple type definition for
URI-reference-3986
does not use this
entity so defined; it defines the datatype as the
union of the separately defined types
URI-3986
and
relative-reference-3986
.
<xs:simpleType name="URI-reference-3986"> <xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <p> The <tt>URI-reference-3986</tt> type checks the string against the regex grammar for URI references in RFC 3986 Appendix A. This is the one most users are likely to say they want when they are looking for a generic URI type and have thought about it for a bit. But it's not: what they are most likely to want in reality is the IRI-reference type defined in the schema document for IRIs, in this directory. </p> <p>The rule in the grammar is:</p> <pre> URI-reference = URI / relative-ref </pre> <p>Rather than write this as a single pattern, however, we will just take a union of the two types already defined.</p> </xs:documentation> </xs:annotation> <xs:union memberTypes="lib:URI-3986 lib:relative-reference-3986"/> </xs:simpleType>
URI-3986
datatype
The URI-3986
datatype includes all those
strings which match the non-terminal URI
in
the ABNF grammar of [RFC 3986]; this includes absolute URIs
with and without fragment identifiers. It excludes
relative references and is thus appropriate only
under special circumstances.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]An URI consists of a scheme, a colon, and a hierarchical part, optionally followed by a literal question mark and a query, and then (again optionally) by a literal hash mark and a fragment identifier. The equivalent regular expression is used as the replacement text for the entity
URI
:
The simple type definition for the URI-3986
datatype restricts the built-in anyURI
type by
requiring that values conform to the pattern defined
by the regular expression in the replacement text of the
entity URI
.
<xs:simpleType name="URI-3986"> <xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <p>RFC 3986 says:</p> <blockquote> <p> A URI is an identifier consisting of a sequence of characters matching the syntax rule named <URI> in Section 3. It enables uniform identification of resources via a separately defined extensible set of naming schemes (Section 3.1). How that identification is accomplished, assigned, or enabled is delegated to each scheme specification. </p> </blockquote> <p> The URI-3986 type checks the string against the regex grammar for URI in RFC 3986 Appendix A. (The regex in Appendix B would be simpler, but it accepts any string of Basic Latin characters, whether they satisfy the grammar for URIs or not. So for validation, it's useless.) </p> <p> Note that the grammar for URI is essentially the same as that for absolute URIs, with the addition of an optional hash mark (#) and fragment identifier: </p> <pre> URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] </pre> </xs:documentation> </xs:annotation> <xs:restriction base="xs:anyURI"> <xs:pattern value="&URI;"/> </xs:restriction> </xs:simpleType>
The hierarchical part, query, and fragment can also occur in other top-level constructs; they are described in later sections (The hierarchical part (§4.7.1), The query (§4.7.5), and The fragment identifier (§4.7.6), respectively).
absolute-uri-3986
datatypeThe datatype absolute-URI-3986
includes
all and only those strings which match the
absolute-URI
grammar production of
[RFC 3986].
absolute-URI = scheme ":" hier-part [ "?" query ]This differs from the
URI
construct
only in omitting the optional hash mark and fragment identifier.
The corresponding entity declaration is:
The simple type definition defines
absolute-URI-3986
as a restriction of anyURI
to the
strings matching the pattern.
<xs:simpleType name="absolute-URI-3986"> <xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <p> The <tt>absolute-URI-3986</tt> type checks the string against the regex grammar for absolute URIs in RFC 3986 Appendix A. </p> <p>The grammar is very like that for URI, but it does not allow a fragment identifier.</p> </xs:documentation> </xs:annotation> <xs:restriction base="xs:anyURI"> <xs:pattern value="&absolute-URI;"/> </xs:restriction> </xs:simpleType>
relative-reference-3986
datatypeThe datatype relative-reference-3986
includes the set of (uninternationalized) relative
references, which are all and only those strings
which match the relative-ref
production of
[RFC 3986].
relative-ref = relative-part [ "?" query ] [ "#" fragment ]The corresponding entity declaration is:
The datatype relative-reference-3986
is
unlikely to be of general utility, as it includes
only URI references relative to the base URI of a
given resource. The type is defined and given a name
here primarily to simplify the definition of the
URI-reference
datatype (defined above,
The URI-reference-3986
datatype (§4.3)).
As with the other datatypes defined here, it restricts
anyURI
by restricting the lexical space to those
strings matching the pattern.
<xs:simpleType name="relative-reference-3986"> <xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <p> The <tt>relative-reference-3986</tt> type checks the string against the regex grammar for relative references in RFC 3986 Appendix A. </p> <p>The top-level rules in the grammar are:</p> <pre> relative-ref = relative-part [ "?" query ] [ "#" fragment ] relative-part = "//" authority path-abempty / path-absolute / path-noscheme / path-empty </pre> </xs:documentation> </xs:annotation> <xs:restriction base="xs:anyURI"> <xs:pattern value="&relative-ref;"/> </xs:restriction> </xs:simpleType>
This section outlines the ABNF rules and corresponding
entity declarations for the constructs referred to by
more than one of the constructs
URI
,
URI-reference
,
relative-reference
, or
absolute-URI
, in so far as these
are different from the corresponding definitions
used for IRIs.
The non-terminal scheme
and several
of the character classes used in
[RFC 3986] are the same as those used in
[RFC 3987] and have already been treated above
(Common constructs in the URI grammars (§4.7)).
hier-part
describes the hierarchical part of a URI. Its
ABNF definition is:
hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty
As with the corresponding IRI-related construct, the translation into regular-expression notation breaks the right-hand side into several smaller pieces and renders the empty string in the last branch of the disjunction as an optionality indicator for the whole construct.
hier-part
<!ENTITY hier-part-1 "(//&authority;&path-abempty;))"> <!ENTITY hier-part-1 "&path-absolute;|&path-rootless;|&path-empty;"> <!ENTITY hier-part "(&hier-part-1;|&hier-part-2;)">
The various declarations relating to the hierarchical part are gathered together in the following fragment:
<!--* The hierarchical part of the URI: authority and path *-->
<!--* Authority: user info, host, port number *-->
《 55 Definition of host, etc. 》
《 54 Definition of authority, user info, and port 》
《 59 Definition of uninternationalized paths 》
《 51 Definition of hier-part
》
<!--* end of hier-part *-->
The non-terminal relative-part
is almost
identical to hier-part
, but it excludes
the non-terminal path-rootless
and adds
path-noscheme
.
relative-part = "//" authority path-abempty / path-absolute / path-noscheme / path-empty
The translation is similar to that for
hier-part
.
<!ENTITY relative-part-1 "(//&authority;&path-abempty;)"> <!ENTITY relative-part-2 "&path-absolute;|&path-noscheme;"> <!ENTITY relative-part "(&relative-part-1;|&relative-part-2;)?"> <!--* Some regexp handlers turn out to have problems with * the trailing empty branch, so delete it and make the * entire expression optional instead. The bug has been * reported, but in the meantime let's work around it. *-->
authority = [ userinfo "@" ] host [ ":" port ] userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) port = *DIGITNote that
port
is the same in
[RFC 3986] and [RFC 3987].
The equivalent regular expressions and entities are these.
<!ENTITY port "(&DIGIT;)*"> <!ENTITY userinfo "([A-Za-z0-9\-\._~!$&'()*+,;=:]|&pct-encoded;)*"> <!--* literal translation: <!ENTITY userinfo "((&unreserved;|&pct-encoded;|&sub-delims;|:))*"> *--> <!ENTITY authority "(((&userinfo;@))?&host;((:&port;))?)">
host = IP-literal / IPv4address / reg-name reg-name = *( unreserved / pct-encoded / sub-delims ) IP-literal = "[" ( IPv6address / IPvFuture ) "]" IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
The corresponding entity declarations are these.
<!--* Host: the most elaborate part of the grammar.
* reg-name, IPv4, IPv6, and IPvFuture.
*-->
《 19 Definition of dec-octet
》
《 18 Definition of IPv4 and IPv6 》
<!ENTITY reg-name
"((&unreserved;|&pct-encoded;|&sub-delims;))*">
<!ENTITY host
"(&IP-literal;|&IPv4address;|®-name;)">
path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters
The translation into entity notation is straightforward.
path
entity<!ENTITY path-1 "&path-abempty;|&path-absolute;|&path-noscheme;"> <!ENTITY path-2 "&path-rootless;|&path-empty;"> <!ENTITY path "(&path-1;|&path-2;)">
path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0<pchar>
The translation into entity notation is again straightforward.
<!ENTITY path-abempty "((/&segment;))*"> <!ENTITY path-absolute "(/((&segment-nz;((/&segment;))*))?)"> <!ENTITY path-noscheme "(&segment-nz-nc;((/&segment;))*)"> <!ENTITY path-rootless "(&segment-nz;((/&segment;))*)"> <!ENTITY path-empty "">
segment = *pchar segment-nz = 1*pchar segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":"
The translation into entity notation follows the usual pattern.
<!ENTITY segment "(&pchar;)*"> <!ENTITY segment-nz "(&pchar;)+"> <!ENTITY segment-nz-nc "((&unreserved;|&pct-encoded;|&sub-delims;|@))+">
The declarations relating to paths are pulled together in the following fragment:
<!--* Path (second major part of hier-part):
* first segments, then various kinds of path *-->
《 58 Definition of uninternationalized segment entity, etc. 》
《 57 Kinds of uninternationalized path 》
《 56 Definition of path
entity 》
The lowest-level constructs in the grammar are the definitions of reserved character, unreserved character, and other character classes. This section presents the ABNF definitions of the classes and their regular-expression equivalents.
pchar
describes the
characters usable in uninternationalized path expressions.
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
The definition pulls the literals
":
" and
"@
" and both classes
of delimiter into the same character class
expression; otherwise it's a literal translation
of the ABNF.
<!ENTITY pchar "([A-Za-z0-9\-\._~!$&'()*+,;=:@]|&pct-encoded;)"> <!--* literal translation: <!ENTITY pchar "(&unreserved;|&pct-encoded;|&sub-delims;|:|@)"> *-->
All these are pulled together by the following DTD fragment.
<!--* Character classes, groups, what have you *--> 《 1 Definition of entities ALPHA, DIGIT, and HEXDIG 》 《 33 Definition of sub-delims 》 《 32 Definition of gen-delim 》 《 31 Definition of reserved 》 《 30 Definition of unreserved 》 <!--* pct-encoded isn't really a character class, but * it needs to fit in here before pchar *--> 《 2 Definition of pct-encoded 》 《 62 Definition of uninternationalized path characters 》
URI-RFC3986
schema documentThe URI-related types defined in this document are all formally defined by the schema document at http://www.w3.org/2001/03/XMLSchema/TypeLibrary-URI-3986.xsd, which gathers together the code fragments given above in a suitable order.
The overall structure of the schema document is as follows:
<?xml version="1.0"?> 《 36 XML stylesheet instruction 》 《 65 Document type declaration 》 <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:lib = "http://www.w3.org/2001/03/XMLSchema/TypeLibrary" version="1.0" elementFormDefault="qualified" xml:lang="en" targetNamespace = "http://www.w3.org/2001/03/XMLSchema/TypeLibrary"> 《 66 Description of the schema document 》 《 46 Simple type definition for URI-3986 》 《 48 Simple type definition for absolute-URI-3986 》 《 50 Simple type definition for relative-reference-3986 》 《 44 Simple type definition for URI-reference-3986 》 《 68 Versioning policy for URI-related types 》 </xs:schema>
The document-type declaration refers to the normative DTD for XSD schema documents, and again includes a fairly extensive internal DTD subset (described more fully below, The DTD internal subset (§4.8.3)).
<!DOCTYPE xs:schema PUBLIC "-//W3C//DTD XMLSchema 200102//EN" "http://www.w3.org/2001/XMLSchema.dtd" [ 《 67 Internal DTD subset 》 ]>
The first xs:annotation
element in the
schema document provides a general description of the
contents and origin of the document.
<xs:annotation> <xs:documentation xmlns="http://www.w3.org/1999/xhtml"> <h3>Introduction</h3> <p>This schema document describes a [draft] component of the XML Schema type library: datatypes for URIs as defined by RFC 3986.</p> <p>The types defined here check the conformance of literal strings against the regular expression given in Appendix A of <a href="http://www.ietf.org/rfc/rfc3986.txt">RFC 3986</a>, translated into XSD notation. See also the <a href="TypeLibrary-IRI-RFC3987.xsd">schema document for IRIs</a> located in the same directory as this document.</p> <p>Please send suggestions for improvements to www-xml-schema-comments@w3.org. Mention the URI of this document: <code><a href=""> http://www.w3.org/2011/04/XMLSchema/TypeLibrary-URI-3986.xsd </a></code> </p> <p>See below (at the bottom of this document) for information about the revision and namespace-versioning policy governing this schema document.</p> </xs:documentation> </xs:annotation>
The internal subset of the DTD includes the entity declarations shown elsewhere in this document, in a suitable sequence.
《 40 Miscellaneous element and attribute declarations 》 <!--* This schema document provides XSD patterns for URI, * URI-reference, and other constructs defined in RFC * 3986. *--> 《 41 Initial explanatory comment 》 《 63 Definitions of character classes 》 《 12 Definition ofscheme
》 《 52 Definition of hierarchical part of URI 》 《 60 Definition of uninternationalizedquery
》 《 61 Definition of uninternationalizedfragment
》 <!--* Relative references *--> 《 53 Definition of relative-part entity 》 《 49 Definition ofrelative-ref
》 <!--* URIs, relative references, URI references *--> 《 45 Definition ofURI
entity 》 《 43 Definition ofURI-reference
entity 》 《 47 Definition ofabsolute-URI
》
The versioning policy for this schema document is the same as for documents in the W3C Technical Reports area: there is a single standard location for the schema document, which will also contain the most recent version of the document approved by the Working Group, and for each revision of the document there is a dated version, which will not change in any substantive way.
<xs:annotation> <xs:documentation> <h3>Versioning policy for this document</h3> <p> In keeping with the XML Schema WG's standard versioning policy, this schema document will persist at the URI < http://www.w3.org/2012/01/XMLSchema/TypeLibrary-URI-3986.xsd >. </p> <p> At the date of issue it can also be found at <http://www.w3.org/2001/03/XMLSchema/TypeLibrary-URI-3986.xsd>. The schema document at that URI may however change in the future, in order to remain compatible with the latest version of XML Schema itself. In other words, if the XML Schema namespace changes, the version of this document at < http://www.w3.org/2001/03/XMLSchema/TypeLibrary-URI-3986.xsd > will change accordingly; the version at < http://www.w3.org/2012/01/XMLSchema/TypeLibrary-URI-3986.xsd > will not change. </p> <p> Previous dated (and unchanging) versions of this schema document include: </p> <ul> <li> http://www.w3.org/2011/04/XMLSchema/TypeLibrary-URI-3986.xsd </li> </ul> </xs:documentation> </xs:annotation>
The notation used in this document to present the contents of schema documents is a form of ‘literate programming’; the term was introduced by [Knuth 1984], who described the basic idea this way:
Let us change our traditional attitude to the construction of programs. Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.
Literate programs contain both prose, which describes the
goals and logic of the program in natural language, and
fragments of source code, which express the program in a
formal notation (often a programming language, here the XML
notation defined by XSD). In the style used here, fragments
of source code are numbered and given headings. For
example, the translation of the ALPHA
non-terminal of [RFC 2234] into an entity
declaration might be presented this way:
<!ENTITY
ALPHA "([A-Za-z])">
" appears in the document
(here the internal DTD subset of an XML document) being
defined.
The source code is presented in a sequence chosen for clear exposition for human readers; often, this is not the same sequence as is required by processors of the formal notation used. References from one scrap to other scraps are used to show how the scraps fit together. If another scrap refers to the one just defined, the reference will look like this:
<!--* First, we declare several basic building * blocks defined by RFC 2234: *--> 《 69 Sample entity declaration for ALPHA 》 ... <!--* Next, we ... *--> ...
ALPHA
will appear in the output
immediately following the comment reading
"First, we declare several basic building blocks
defined by RFC 2234:" and preceding the other
comment shown.
When viewed using a Web browser, the references to other
scraps are hyperlinked; references also give the number
of the scrap referred to, for the convenience of those
reading printed copies of the document.
Other notations that may be unfamiliar to the reader are documented elsewhere. For information on the ABNF notation used in [RFC 3986] and [RFC 3987], see [RFC 5234]. (Familiarity with the fundamentals of context-free grammars is assumed.)
For information on the regular-expression notation used here to reconstruct the grammatical constraints of the ABNF, see the definition of that notation in [XSD 1.0 Part 2: Datatypes] or (revised and clarified, but substantially the same) in [XSD 1.1 Part 2: Datatypes]. The syntax of XSD type declarations is described in [XSD 1.0 Part 1: Structures] and [XSD 1.1 Part 1: Structures], and in a number of tutorials and overviews of XSD.
This appendix is a temporary holding bin for material in the ABNFs and DTDs that has not yet been placed in a suitable location in the document.
This document was prepared by the W3C XML Schema Working Group. At the time of publication, the members in good standing were:
Some changes may be made in future work on this document: