- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Wed, 18 Mar 2009 18:21:01 -0700
- To: Dan Connolly <connolly@w3.org>
- Cc: public-html@w3.org
On Mar 18, 2009, at 6:56 AM, Dan Connolly wrote: > Attached find a draft for ACTION-68 ... (converted to text) > Web addresses in HTML 5 It would be better called "Referencing a URI in HTML5" Seriously, I consider that to be a show-stopper. There are these things we have variously called UDIs, Universal RIs, URLs, URNs, and now Uniform Resource Identifiers. Those are all Web addresses. They defined by an Internet Standard. We spent over 10 years getting the entire world to agree on what they are and how they are communicated across the Internet. That work is done! Starting over just because HTML5 can't get Web terminology right is ridiculous. Web addresses are not defined by HTML5 any more than the physical address of your house is defined by an envelope sent via USPS mail. What appears on the envelope is a reference to the physical address which is interpreted by various postal carriers in their attempt to route that mail to your postal box. Some references are less ambiguous than others. Some can be parsed by machines. Some references can only be understood by the last carrier on the route. What HTML5 defines is the process of getting from the reference to the URI within the context of the generic syntax standard. There is absolutely no reason to call that a new definition for Web Addresses, URLs, or anything else like it. The URI standard applies to all Web technology and formats, not just HTML5 as it might be used by future browsers. Calling the input to that process a "Web address" is needlessly confusing when the same term is already commonly understood to be the output of that process. > This specification defines the term Web address, and defines > various Web address is an already defined term. This is the same objection I made to calling these references URLs. Just call them references. What HTML documents contain are various attributes that have values which reference one or more URIs through the use of a sequence of characters in the document encoding. In order to get from the CDATA to the referenced URI, the parser must extract the sequence of characters, transcode it to UTF-8, decompose the string into the generic syntax components as defined by RFC 3986 (note that this works even if the reference is invalid or an IRI), IDNA-encode (for host) or percent-encode each of those components for any octets that are not allowed in the URI syntax, and then resolve it relative to the base URI as defined in RFC 3986. In fact, I would argue that the above paragraph is sufficient to define these strings for HTML5, but I have no doubt that others will want to explain each of those parts in detail. That's fine, but only as long as you don't repeat or redefine the parts that are already defined by 3986. There is no need to. > algorithms for dealing with Web addresses, because for > historical reasons > the rules defined by the URI and IRI specifications are not a > complete > description of what HTML user agents need to implement to be > compatible > with Web content. Bah! The URI and IRI specifications only define the parts that are common to all Web technology. There has never been any reason for them to define HTML5 parsing. > 1 Terminology > > A Web address is a string used to identify a resource. > > The term "Web address" in this specification is used to include > not only > Uniform Resource Identifiers (URIs) as they are defined by RFC > 3986 and > Internationalized Resource Identifiers (IRIs) as they are > defined by RFC > 3987, but also other strings of characters which can be used to > identify > Web resources when processed appropriately. > > A Web address is a valid Web address if at least one of the > following > conditions holds: > > * The Web address is a valid URI reference (i.e. it matches > the grammar > for <URI-reference&ft; given in RFC 3986). > > * The Web address is a valid IRI reference (i.e. it matches > the grammar > for <IRI-reference&ft; given in RFC 3987), and it has no query > component. > > * The Web address is a valid IRI reference and its query > component > contains no unescaped non-ASCII characters [RFC3987]. > > * The Web address is a valid IRI reference and the character > encoding of > the Web address's Document is UTF-8 or UTF-16 [RFC3987]. > > A Web address has an associated URL character encoding, > determined as > follows: > > If the Web address came from a script (e.g. as an argument to a > method) > The Web address character encoding is the script's > character > encoding. > > If the Web address came from a DOM node (e.g. from an element) > The node has a Document, and the URL character encoding > is the > document's character encoding. > > If the Web address had a character encoding defined when the Web > address > was created or defined > The Web address character encoding is as defined. s/Web address/reference to a URI/g; > 2 Parsing Web addresses ... sorry, this entire section is disconnected from reality. It doesn't match any of the known implementations and directly contradicts the standard. > 3 Resolving Web addresses ... this section should be split into "establishing the base URI for a given reference" and "resolving URI components." Browser-specific error handling should be applied to each component of the base URI and reference after they have been parsed into components according to the algorithm in RFC3986. There is no need to repeat the URI syntax and resolution algorithm here. The algorithm in 3986 is specifically designed to accept any string, which means that it is impossible for it to result in an "error" -- the bad characters are just placed in the corresponding component and the resolution process (the algorithm that interprets the meaning of those components as forming a scheme-specific address) is responsible for identifying the intra-component errors and intra-component encoding requirements. That is necessary because the contents of a component are only an error when interpreted according to the URI scheme's semantics. Likewise, if the base URI is a valid URI and the reference components are valid components in URI-encoded form, then the resolution process (combining reference and base URI) will always result in a valid URI. Cheers, Roy T. Fielding <http://roy.gbiv.com/> Chief Scientist, Day Software <http://www.day.com/>
Received on Thursday, 19 March 2009 01:21:30 UTC