- From: Larry Masinter <LMM@acm.org>
- Date: Tue, 18 May 2010 12:00:14 -0700
- To: <public-html@w3.org>, <public-iri@w3.org>
In the deluge of the last few months, I've gotten way behind, so forgive me if I've missed some of this and am treading old ground. I'm cc'ing both W3C HTML and IETF IRI mailing lists, since the message addresses the interface between the two specifications: Re the HTML working group change proposal: http://lists.w3.org/Archives/Public/public-html/2010Apr/0147.html for W3C HTML working group Issue-56 http://www.w3.org/html/wg/tracker/issues/56 I was asked for a comment: > Larry, can you comment on the feasibility of the requested IRIbis > changes? It would be really great to get this HTML WG issue settled, > so that remaining IRI/URI/URL work can move into the IRIbis WG. I think this seems quite feasible to accomplish the rationale, but I have comments and questions on the requirements: # ISSUE-56 # ======== # # SUMMARY # The HTML specification is changed slightly to reference the IRI # specification using a well-defined interface. This is fine, although we need to agree on "well-defined", and I think there might be some difference of opinion. # RATIONALE # To ensure a clean modular separation of the IRI and HTML specifications, # an interface is needed. This allows the specifications to co-exist in a # well-defined way without each specification needing to be continually # updated as the other is fixed (for example, changing references to section # numbers or step numbers). Agree, this would be much better and quite feasible. # DETAILS # Update the IRI specification to define two algorithms: I'm reluctant to define "algorithms" for interfaces when "constraints on interface implementations" are more appropriate. While it is important that web content that contains scripting that parses hypertext reference elements (whether you call them IRIs or URLs) need to work reliably for existing web content, it seems also reasonable to allow for variations between implementations that don't matter in practice, especially when currently deployed implementations actually vary in ways that are visible (it's possible to construct test where the results don't match), but unimportant (e.g., parsing an invalid string might give different results.) That is, in some cases, it may be reasonable to allow implementations to "agree to disagree", because the disagreements are irrelevant. Allowing is reasonable, because URL/IRI parsing is implemented in so many systems, not just browsers, and making currently conforming implementations non-conforming requires stronger justification than just a first-principle preference for precision. Giving definitions of interfaces in terms of required relationships between input and output, some of those relationships may well be best specified as SHOULD rather than MUST. < * parsing an address (relative or absolute): algorithm to obtain a < failure/success condition (not the same as whether the input is < valid or not, just whether it can be parsed), and the following < components, from parsing an arbitrary string: < - <scheme> component < - <host> component < - <port> component < - <hostport> component < - <path> component < - <query> component < - <fragment> component < - <host-specific> component There are some things about this requirement that I don't really understand: IRIs and IRIbis are defined in terms of "sequence of character" where character is taken from the Unicode repertoire. The requirement is written in terms of "string", though, and the mapping is unclear at the moment. Do all implementations first translate input strings in their native character encodings into Unicode before parsing? Are there implementations which parse strings whose encoding is a charset that does not map fully into Unicode (I vaguely recall iso-2022-jp as an example)? What is the range of the "failure" condition? I.e., is it allowed merely to signal a "failure" or is the nature of the failure also need to be available/discernable? Is it necessary/possible/allowed to distinguish between "component empty" and "component not present"? E.g., for http://example.com:/ vs http://example.com/ does the "port" component distinguish (is it allowed to distinguish) between component not present? Should <hostport> be part of this interface, given that it can be reconstructed from <host> and <port>? In the case where the interface signals an error: Must all of the results be defined? Must they return exactly the same results? I think making that a requirement needs some justification. > * resolving an address A relative to a base address B with an encoding C: > algorithm for parsing an arbitrary string A and resolving it relative > to address B (which will have been resolved, but may be invalid), using > a specified character encoding C, and returning either success or > failure, and in the case of success, a string, with the following > conditions: What is the range of the failure signal? I.e., resolverelative(A, B, C) ==> failure is the interfa What does it mean for B to "have been resolved", or was this an aside? I think this is alluding to the presumption that this interface resolverelative(A, B, C) that the input B is in the range of a previous resolverelative step? Or is it? Is there some assumption that B is an absolute IRI without a fragment identifier? Or with one? > - the output of the algorithm must be idempotent even if the base > argument is changed (i.e. once resolved, resolving it again with > the same character encoding cannot change the result) I think this is saying something like resolverelative(resolverelative(A, B1, C), B2, C) == resolverelative(A, B1, C) I think the interface definition probably needs to be clearer of the role of character encoding, unless these are meant to be "sequence of octet" strings rather than "sequence of unicode character" strings. > - resolving preserves errors, e.g. resolving "http://example.com##" > returns "http://example.com/##" not "http://example.com/#%C3". I think calling this "preserves errors" might be confusing in the context of an interface which allows "errors" to be signaled. I'm not sure why this is a requirement; I think there's been some discussion of this on the mailing list, but perhaps someone could recap. > Update the HTML spec to use these algorithms and reference the IRI spec > that defines them. I would express this as: "Update the HTML spec to use the methods whose interfaces are specified in the IRI spec."
Received on Tuesday, 18 May 2010 19:07:07 UTC