Re: Advice on making IRI document suitable for reference by HTML (and other specs)

>> I'd be happy to perform some testing on the "widget"
>> URI's.  It would be helpful if you could describe
>> (offline perhaps) some of the general test cases or
>> goals you had in mind.

Our high-level goal is to automatically test successive versions of
each implementation and generate reports from the results. Two (or
more) implementations are compatible if, for the same input, they
generate the same outputs. For now, we have been focusing on
machine-readable outputs, such as DNS packets, HTTP packets and DOM
interfaces. Eventually, it would be nice to include human-readable
outputs, such as the address bar, status bar, etc.

We have a very rudimentary setup at http://code.google.com/p/curlies
but if you prefer to work with a different framework such as
http://browserscope.org or any other setup, that would of course be
fine too.

We have been testing all ASCII values (0 to 127) and a few specially
selected non-ASCII values, such as a Big5 character that is known not
to round-trip through Unicode. We have been testing these in all parts
of the URL (host, path, query, and fragment). We have been testing
both <img> and <form> along with a few <base> tests, but we intend to
test <a> too.

>> As far as what other places might browsers output
>> IRI's, I would like to do some testing of HTTP
>> request headers.  In some past testing I noticed
>> that response headers can emit pure IRI's from
>> the server-side.
>
> It is possible for ASP-based sites or nph-style CGI scripts to
> do all sorts of nasty things.  However, they are just bugs and
> should be treated as such.

We have done some non-ASCII tests on the HTTP request side. See, for
example, how IE handles non-ASCII ?query parts:

http://curlies.googlecode.com/svn/trunk/test_results/operating_systems/WinXP_SP3/query_big5_results.html

(Currently, the report generator emits '.' for non-ASCII bytes. This
will be fixed.)

We have not tested non-ASCII URLs on the HTTP response side, but it
would be nice to know how the browsers handle those too. (What
character encoding(s) do browsers use? ISO-8859-1? UTF-8?
Document-dependent encoding?)

>> Other than the address bar and status bar, there
>> would be dialog boxes such as File Download
>> dialogs, and dialogs from plugins such as Flash
>> and Silverlight. Also, bookmark lists, history and
>> favorites have been interesting places to get
>> IRI's recorded and displayed.
>
> It would be nice if we had a list of all the places that
> identifiers are input or stored in a common browser, though
> most of these are implementation-specific (it really does not
> matter for interoperability how a browser stores identifiers
> for its bookmark list, so we should not restrict legitimate
> experiments in implementation efficiency).

If bookmarks or favorites are stored in UTF-8 without the original
encoding info, the ?query part has to be in the original encoding and
percent-encoded. (This is the "self-contained" IRI issue.)

Erik

Received on Tuesday, 5 January 2010 22:21:26 UTC