Re: [minutes] URL rewriting / re: meeting 2008-12-16 from Francois Daoust on 2009-01-06 (public-bpwg-ct@w3.org from January 2009)

From: Francois Daoust <fd@w3.org>
Date: Tue, 06 Jan 2009 13:03:23 +0100
To: casays@yahoo.com
CC: public-bpwg-ct@w3.org
Message-ID: <4963488B.90502@w3.org>
Thanks for the analysis.

I don't really understand either why the doc would stand or fall by this 
question.

I note that pagination, that requires URI rewriting, is commonly done 
because big pages are more the rule than the exception to the rule. URI 
tokenization to reduce the size of URIs in the markup sent to the client 
could also fit in the list of transformations that require URI 
rewriting, but is probably not done on such a wide basis.

The question on whether we want to recommend "good" practices on that 
particular topic still remains. There was not enough people on BPWG call 
last time to get the opinion of the working group at large on this, so 
nothing new on that front. If we merge the task force back in the group, 
it would be easier to get everyone's opinion on this.

Francois.


Eduardo Casais wrote:
> I would like to comment on a statement made during the last teleconference:
> "link rewriting is central to CT", which led to the conclusion that "the CT 
> doc stands or falls by this question".
> 
> After pondering the question, I contend that the above viewpoint is overly
> pessimistic because URL rewriting is not indissolubly linked to content 
> transformation. My considerations revolve around
> 1) the mode of access to content transformation functions;
> 2) the algorithmic nature of individual transformations.
> 
> 
> 1.	Access modes.
> 
> There are two major ways a content transformation proxy may operate:
> 
> a) As a detour to applications. This is how on-line CT-proxies like the Google
> Wireless Transcoder work: the end-user explicitly accesses the proxy, types in
> the URL of the server to be accessed; the proxy fetches and modifies the 
> content before sending it back to the terminal. 
> 
> b) As an in-band filter. This is how most operator-based CT-proxies work: all
> requests are, without user's intervention, conveyed to the CT-proxy, which 
> forwards them to the server, then intercepts the responses, modifies them and
> returns the outcome to the terminal.
> 
> The important point is that in (a) the end-user can access servers directly,
> without making the detour via the CT-proxy, whereas in (b) all HTTP flows pass
> (whether adapted or not) through the CT-proxy. This means that in (a) there is
> truly no other possibility for the CT-proxy to present a seamless experience 
> for adapted content than to rewrite every URL in application content so that 
> subsequent requests are directed to the CT-proxy. Otherwise, further HTTP
> requests would travel along the normal Internet routes, bypass the CT-proxy and
> return unmodified data.
> 
> In (b), the CT-proxy does not need to force HTTP flows to pass through it, since
> they already do. Hence, there is fundamentally no obligation to rewrite all URL.
> I presume this is what Sean meant with his statement that "It is not necessary 
> to do link rewriting if the CT proxy is in 'proxy mode'".
> 
> 
> 2.	Transformation algorithms.
> 
> Many content transformations have been implemented and deployed (as far back
> as ten years ago) to enable mobile devices to retrieve content intended for
> desktops. Here are the important ones, assuming access mode (b) (i.e. in-band
> proxy), since mode (a) must rewrite URL in essentially every case.
> 
> a) Document recoding, e.g. changing character encoding from UTF-8 to Shift_JIS
> or from a non-ASCII format to UUENCODE.
> No URL rewriting needed.
> The URL representation might change because of the recoding, but domains, 
> paths, ports, etc, remain unchanged.
> 
> b) Markup clean-up, e.g. similar to what tidy does.
> No URL rewriting needed.
> Adjustments to URL representations might take place (e.g. adjusting the syntax
> of special entities, turning reserved characters into %NN notation), but no 
> alterations to their domains, paths, ports, etc.
> 
> c) Compaction, e.g. eliminating comments in source code. 
> No URL rewriting needed.
> 
> d) Image redimensioning, e.g. cropping and scaling.
> No URL rewriting needed.
> 
> e) Image colour reduction, e.g. conversions from colour to black-and-white.
> No URL rewriting needed.
> 
> f) Image resolution reduction, e.g. resolution reduction for low-bandwidth
> transmission. 
> No URL rewriting needed.
> Some operators allow the end-user to retrieve the original picture (with an 
> unaltered resolution) by clicking on the image displayed by the browser. In 
> this case, the corresponding URL in the markup document must be rewritten (and
> extended to become a clickable image). In the basic case, where this possibility
> is not given, no URL rewriting takes place.
> 
> g) Conversion, e.g. from HTML 4.0 to XHTML mobile profile 1.0, or JPEG to GIF.
> No URL rewriting required.
> A request for a document mydoc.html (resp. myimg.jpg) may thus return data that
> is actually XHTML mp (resp. GIF), as indicated by the HTTP header field 
> Content-type. The Content-location field can be set with a suitable name 
> (whose extension is concocted to fit the content type) of the resource 
> presented to the user. There is an issue if:
> g.1: the user saves the document locally,
> g.2: and the file store uses the original URL instead of the name in the 
> Content-location field, or there is no Content-location field,
> g.3: and the browser interpret data not on the basis of declarations or of
> magic numbers inside the document, but on the file extension of the original
> URL,
> g.4: and the original extension is one that the browser can parse and interpret
> (e.g. .html or .jpeg, but not .php or .exe).
> I do not clearly see how this problem would be solved with URL rewriting,
> anyway.
> 
> h) Linearization.
> URL rewriting needed.
> Insofar as linearization involves rearranging elements on the page, or 
> decomposing tables into lines, no URL rewriting is required. However, as soon
> as frames must be coalesced into a single page, rewriting might be required 
> at least to disambiguate fragment identifiers amongst individual frames (for
> internal pointers, i.e. those URL pointing from a frame to another place
> within the frame).
> 
> i) Splitting, e.g. when a large document is subdivided into several interlinked
> sub-documents.
> URL rewriting needed.
> Links pointing out of the original document (e.g. style sheets, images) do not
> require rewriting. Additional links inserted to navigate between sub-documents
> are formally no rewriting of existing links. On the other hand, internal links
> (i.e. those pointing within the original document itself) must be altered,
> because the base URL has changed: it is the URL cached on the CT-proxy, or
> generated and assigned to the Content-location field -- not the one of the 
> original document.
> 
> j) Lossless compression, i.e. GZIP compression.
> No URL rewriting needed.
> 
> k) Natural language translation, e.g. like babelfish.
> No URL rewriting needed.
> 
> 
>>From this (admittedly succint) analysis, I conclude that one must not infer
> from the specific architecture of a few well-publicized CT-proxies that URL
> rewriting is indispensible for transcoding. There is a large number of
> important content adaptations that do not require any URL rewriting whatsoever,
> under the largely dominant mode of "in-band" proxies deployed among telecom
> operators.
> 
> Therefore, no matter how difficult the issue of URL rewriting appears to be,
> the CT guidelines will be relevant and apply to CT-proxies -- and the work 
> carried out during the last year by the task force will not have been wasted.
> At least we start 2009 with good news.
> 
> 
> E.Casais
> 
> 
>       
> 
> 
> 
>
Received on Tuesday, 6 January 2009 12:03:58 UTC