- From: Robin Berjon <robin@w3.org>
- Date: Tue, 26 Feb 2013 11:59:41 +0100
- To: Jeni Tennison <jeni@jenitennison.com>
- CC: "www-tag@w3.org List" <www-tag@w3.org>
On 26/02/2013 08:45 , Jeni Tennison wrote: > On 25 Feb 2013, at 17:08, Robin Berjon <robin@w3.org> wrote: >> On 25/02/2013 17:40 , John Kemp wrote: >>> The reason that text/plain vs. text/html comes up so often is >>> that it is a very clear description of one problem with sniffing >>> - that the author intended the representation to be displayed as >>> text without HTML interpretation. >> >> I'd be less charitable. I think that this example keeps coming up >> because proponents of authoritative metadata cannot think of any >> other example :) > > I think the reason that this comes up is that HTML is the one of the > only formats that (a) you can sensibly view as some other format and > (b) would otherwise be handled differently by a browser. Other > formats that would fall into the same category would be SVG, MathML > and XML with an associated stylesheet. I guess iCal too, given that's > likely to open up in a calendar app otherwise. And in all of these cases (one can list a few more) you're looking at interpreting the information as text. Can you think of cases in which you would want to interpret as something else? (That don't involve deliberate security attacks — I can only think of those.) If the only use case is displaying something as text instead of having it interpreted, this points towards a solution that allows to indicate that something is text, rather than a general-purpose mechanism. >> Well, there's <plaintext> for that if you're sure that that's what >> you want. For all the other cases it would seem that View Source >> can work. > > The one example I came across recently where a big site is > purposefully using Content-Type: text/plain is github. Look at: > > https://raw.github.com/darobin/respec/develop/tests/SpecRunner.html > > for example. But it's rather unclear why they're doing that. This behaviour is usually considered an annoyance. > I'm sure that they could have implemented this differently, though I > can't think of a way that would both avoid the security issues of > serving arbitrary user-provided HTML out of the github.com domain and > make it easy for people to download individual files without editing > them. That's not the motivation. The only thing you need to serve user-provided off github.com is to place it in a branch called "gh-pages". See the file you pointed to above here: http://darobin.github.com/respec/tests/SpecRunner.html Nothing prevents you from grabbing that too! > 1. What is the alternative? I don't believe that we can fix what we already have (at least, it would be very very hard). But we can prevent the issues from propagating further by recommending some form of magic number for new data formats. It's not as if there isn't precedent in W3C: • PNG has an 8 byte signature (that includes "PNG") http://www.w3.org/TR/PNG/#5PNG-file-signature • Cache manifests use "CACHE MANIFEST" http://www.w3.org/html/wg/drafts/html/master/browsers.html#manifests • EXI uses an "EXI Cookie" that is $EXI (I wonder who's to blame for that) http://www.w3.org/TR/exi/#key-exiCookie The EXI case is interesting in part because they ran into a limitation of authoritative metadata: they wanted it to be possible to receive a representation over HTTP that would be e.g. Content-Type: image/svg+xml *and* Content-Encoding: exi, and keep that information when saved to disk. That's problematic. > 2. How could we transition to that alternative? A TAG finding indicating the issues with authoritative metadata would be a good start. It could include discussion of how to produce a good magic number. We have three in-house file formats that each use a different magic numbers — maybe one is better than the others? > 3. What should the TAG do to enable that transition to > happen? A new finding, and raising awareness. -- Robin Berjon - http://berjon.com/ - @robinberjon
Received on Tuesday, 26 February 2013 10:59:51 UTC