[minutes] URL rewriting / re: meeting 2008-12-16 from Eduardo Casais on 2009-01-05 (public-bpwg-ct@w3.org from January 2009)

From: Eduardo Casais <casays@yahoo.com>
Date: Mon, 5 Jan 2009 09:02:00 -0800 (PST)
To: public-bpwg-ct@w3.org
Message-ID: <857344.57827.qm@web45002.mail.sp1.yahoo.com>
I would like to comment on a statement made during the last teleconference:
"link rewriting is central to CT", which led to the conclusion that "the CT 
doc stands or falls by this question".

After pondering the question, I contend that the above viewpoint is overly
pessimistic because URL rewriting is not indissolubly linked to content 
transformation. My considerations revolve around
1) the mode of access to content transformation functions;
2) the algorithmic nature of individual transformations.


1.	Access modes.

There are two major ways a content transformation proxy may operate:

a) As a detour to applications. This is how on-line CT-proxies like the Google
Wireless Transcoder work: the end-user explicitly accesses the proxy, types in
the URL of the server to be accessed; the proxy fetches and modifies the 
content before sending it back to the terminal. 

b) As an in-band filter. This is how most operator-based CT-proxies work: all
requests are, without user's intervention, conveyed to the CT-proxy, which 
forwards them to the server, then intercepts the responses, modifies them and
returns the outcome to the terminal.

The important point is that in (a) the end-user can access servers directly,
without making the detour via the CT-proxy, whereas in (b) all HTTP flows pass
(whether adapted or not) through the CT-proxy. This means that in (a) there is
truly no other possibility for the CT-proxy to present a seamless experience 
for adapted content than to rewrite every URL in application content so that 
subsequent requests are directed to the CT-proxy. Otherwise, further HTTP
requests would travel along the normal Internet routes, bypass the CT-proxy and
return unmodified data.

In (b), the CT-proxy does not need to force HTTP flows to pass through it, since
they already do. Hence, there is fundamentally no obligation to rewrite all URL.
I presume this is what Sean meant with his statement that "It is not necessary 
to do link rewriting if the CT proxy is in 'proxy mode'".


2.	Transformation algorithms.

Many content transformations have been implemented and deployed (as far back
as ten years ago) to enable mobile devices to retrieve content intended for
desktops. Here are the important ones, assuming access mode (b) (i.e. in-band
proxy), since mode (a) must rewrite URL in essentially every case.

a) Document recoding, e.g. changing character encoding from UTF-8 to Shift_JIS
or from a non-ASCII format to UUENCODE.
No URL rewriting needed.
The URL representation might change because of the recoding, but domains, 
paths, ports, etc, remain unchanged.

b) Markup clean-up, e.g. similar to what tidy does.
No URL rewriting needed.
Adjustments to URL representations might take place (e.g. adjusting the syntax
of special entities, turning reserved characters into %NN notation), but no 
alterations to their domains, paths, ports, etc.

c) Compaction, e.g. eliminating comments in source code. 
No URL rewriting needed.

d) Image redimensioning, e.g. cropping and scaling.
No URL rewriting needed.

e) Image colour reduction, e.g. conversions from colour to black-and-white.
No URL rewriting needed.

f) Image resolution reduction, e.g. resolution reduction for low-bandwidth
transmission. 
No URL rewriting needed.
Some operators allow the end-user to retrieve the original picture (with an 
unaltered resolution) by clicking on the image displayed by the browser. In 
this case, the corresponding URL in the markup document must be rewritten (and
extended to become a clickable image). In the basic case, where this possibility
is not given, no URL rewriting takes place.

g) Conversion, e.g. from HTML 4.0 to XHTML mobile profile 1.0, or JPEG to GIF.
No URL rewriting required.
A request for a document mydoc.html (resp. myimg.jpg) may thus return data that
is actually XHTML mp (resp. GIF), as indicated by the HTTP header field 
Content-type. The Content-location field can be set with a suitable name 
(whose extension is concocted to fit the content type) of the resource 
presented to the user. There is an issue if:
g.1: the user saves the document locally,
g.2: and the file store uses the original URL instead of the name in the 
Content-location field, or there is no Content-location field,
g.3: and the browser interpret data not on the basis of declarations or of
magic numbers inside the document, but on the file extension of the original
URL,
g.4: and the original extension is one that the browser can parse and interpret
(e.g. .html or .jpeg, but not .php or .exe).
I do not clearly see how this problem would be solved with URL rewriting,
anyway.

h) Linearization.
URL rewriting needed.
Insofar as linearization involves rearranging elements on the page, or 
decomposing tables into lines, no URL rewriting is required. However, as soon
as frames must be coalesced into a single page, rewriting might be required 
at least to disambiguate fragment identifiers amongst individual frames (for
internal pointers, i.e. those URL pointing from a frame to another place
within the frame).

i) Splitting, e.g. when a large document is subdivided into several interlinked
sub-documents.
URL rewriting needed.
Links pointing out of the original document (e.g. style sheets, images) do not
require rewriting. Additional links inserted to navigate between sub-documents
are formally no rewriting of existing links. On the other hand, internal links
(i.e. those pointing within the original document itself) must be altered,
because the base URL has changed: it is the URL cached on the CT-proxy, or
generated and assigned to the Content-location field -- not the one of the 
original document.

j) Lossless compression, i.e. GZIP compression.
No URL rewriting needed.

k) Natural language translation, e.g. like babelfish.
No URL rewriting needed.


>From this (admittedly succint) analysis, I conclude that one must not infer
from the specific architecture of a few well-publicized CT-proxies that URL
rewriting is indispensible for transcoding. There is a large number of
important content adaptations that do not require any URL rewriting whatsoever,
under the largely dominant mode of "in-band" proxies deployed among telecom
operators.

Therefore, no matter how difficult the issue of URL rewriting appears to be,
the CT guidelines will be relevant and apply to CT-proxies -- and the work 
carried out during the last year by the task force will not have been wasted.
At least we start 2009 with good news.


E.Casais
Received on Monday, 5 January 2009 17:03:28 UTC