W3C home > Mailing lists > Public > www-tag@w3.org > February 2013

Re: Revisiting Authoritative Metadata

From: Robin Berjon <robin@w3.org>
Date: Tue, 26 Feb 2013 11:59:41 +0100
Message-ID: <512C959D.2070703@w3.org>
To: Jeni Tennison <jeni@jenitennison.com>
CC: "www-tag@w3.org List" <www-tag@w3.org>
On 26/02/2013 08:45 , Jeni Tennison wrote:
> On 25 Feb 2013, at 17:08, Robin Berjon <robin@w3.org> wrote:
>> On 25/02/2013 17:40 , John Kemp wrote:
>>> The reason that text/plain vs. text/html comes up so often is
>>> that it is a very clear description of one problem with sniffing
>>> - that the author intended the representation to be displayed as
>>> text without HTML interpretation.
>> I'd be less charitable. I think that this example keeps coming up
>> because proponents of authoritative metadata cannot think of any
>> other example :)
> I think the reason that this comes up is that HTML is the one of the
> only formats that (a) you can sensibly view as some other format and
> (b) would otherwise be handled differently by a browser. Other
> formats that would fall into the same category would be SVG, MathML
> and XML with an associated stylesheet. I guess iCal too, given that's
> likely to open up in a calendar app otherwise.

And in all of these cases (one can list a few more) you're looking at 
interpreting the information as text. Can you think of cases in which 
you would want to interpret as something else? (That don't involve 
deliberate security attacks  I can only think of those.)

If the only use case is displaying something as text instead of having 
it interpreted, this points towards a solution that allows to indicate 
that something is text, rather than a general-purpose mechanism.

>> Well, there's <plaintext> for that if you're sure that that's what
>> you want. For all the other cases it would seem that View Source
>> can work.
> The one example I came across recently where a big site is
> purposefully using Content-Type: text/plain is github. Look at:
> https://raw.github.com/darobin/respec/develop/tests/SpecRunner.html
> for example.

But it's rather unclear why they're doing that. This behaviour is 
usually considered an annoyance.

> I'm sure that they could have implemented this differently, though I
> can't think of a way that would both avoid the security issues of
> serving arbitrary user-provided HTML out of the github.com domain and
> make it easy for people to download individual files without editing
> them.

That's not the motivation. The only thing you need to serve 
user-provided off github.com is to place it in a branch called 
"gh-pages". See the file you pointed to above here:


Nothing prevents you from grabbing that too!

> 1. What is the alternative?

I don't believe that we can fix what we already have (at least, it would 
be very very hard). But we can prevent the issues from propagating 
further by recommending some form of magic number for new data formats. 
It's not as if there isn't precedent in W3C:

 PNG has an 8 byte signature (that includes "PNG")

 Cache manifests use "CACHE MANIFEST"

 EXI uses an "EXI Cookie" that is $EXI (I wonder who's to blame for that)

The EXI case is interesting in part because they ran into a limitation 
of authoritative metadata: they wanted it to be possible to receive a 
representation over HTTP that would be e.g. Content-Type: image/svg+xml 
*and* Content-Encoding: exi, and keep that information when saved to 
disk. That's problematic.

> 2. How could we transition to that alternative?

A TAG finding indicating the issues with authoritative metadata would be 
a good start. It could include discussion of how to produce a good magic 
number. We have three in-house file formats that each use a different 
magic numbers  maybe one is better than the others?

> 3. What should the TAG do to enable that transition to
> happen?

A new finding, and raising awareness.

Robin Berjon - http://berjon.com/ - @robinberjon
Received on Tuesday, 26 February 2013 10:59:51 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:33:19 UTC