Re: let authors choose text/html or application/xhtml+xml (detailed review of section 1. Introduction) from Robert Burns on 2007-08-31 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Fri, 31 Aug 2007 14:25:15 -0500
To: Roy T.Fielding <fielding@gbiv.com>
Cc: "public-html@w3.org WG" <public-html@w3.org>
Message-Id: <217F2A1D-84ED-44CD-8A81-9ABC2B58538E@robburns.com>
HI Roy,

On Aug 31, 2007, at 1:20 PM, Roy T. Fielding wrote:

> On Aug 31, 2007, at 10:59 AM, Robert Burns wrote:
>> On Aug 31, 2007, at 12:31 PM, Roy T. Fielding wrote:
>>> On Aug 31, 2007, at 8:01 AM, Robert Burns wrote:
>>>>> One of the main reasons for this is because the W3C hasn't made  
>>>>> it clear to developers and browser manufacturers that it's the  
>>>>> media-type ("application/xhtml+xml") that people need to get  
>>>>> used to, not just the XML syntax of XHTML, and it's the media- 
>>>>> type that makes the document XHTML.
>>>>
>>>> We've been discussing this at length on the "review of content  
>>>> type rules by IETF/HTTP community"  thread (see also the wiki  
>>>> page [1]). I think a more accurate way to think of it is that a  
>>>> file's type is determined by the internals of the file and the  
>>>> authoring tool.
>>>
>>> No, that is the completely wrong way to think of it.  Media types
>>> define how a given sequence of bytes are intended to be processed
>>> by the recipient.  I can author dozens of types in vim.  It is
>>> impossible to determine the media type of content by sniffing.
>>> It is sometimes possible to determine a range of possible media
>>> types and pick one based on configuration, but there are always
>>> exceptions that will cause such a pick to be wrong.
>>
>> I'm not sure what I said conflicts with what you're saying. My  
>> point is that an author and the tool the author uses creates a  
>> file of a certain type (even before it reaches an HTTP server). No  
>> sniffing is necessary at this stage because the author and  
>> authoring tool combination already know the type of file they're  
>> creating. As you said "I can author dozens of types in vim". And  
>> you are the one in charge of deciding what type you're authoring.  
>> You may be saving it to disk with each edit and each time the HTML  
>> file you're authoring is made available as a PNG file through an  
>> http daemon. Does that misconfigured server say anything about the  
>> file type you're authoring in vim? No and it shouldn't
>
> That is still wrong.  Media Type != Data Format.  Authoring tools know
> data formats (at least supersets, like text/*).  Authoring tools never
> know HTTP's value for Content-Type.  Never.

I'm trying to understand what you're saying, but you're using many  
different terms here.
  • "Media Type != Data Format" OK, however data formats are often  
expressed through media types, right?
  • "Authoring tools know data formats (at least supersets, like text/ 
*)". Isn't text/* a media type. So here the authoring tool knows the  
data format as expressed as a media type like "text/plain". Also for  
an authoring tool that authors only HTML (not plain text) wouldn't  
that data format be expressed as the media type "text/html". So if  
data formats are expressed with the same names as media types, where  
is the difference. Is media type only about expressing how the author  
wants the data format handled (e.g, as text/plain instead of text/ 
html). However then I think we're missing a place for metadata that  
expresses the files' data format (and allows that to be efficiently  
retrieved over the network).
  • "Authoring tools never know HTTP's value for Content-Type." Here  
I think is the problem. Its the HTTP content-type that should be set  
based on the author's and the authoring tool's specification and  
therefore there's no reason for the authoring to to know the HTTP's  
value. Rather the HTTP content-type value should be dependent on the  
author / authoring tool determination.

> You are thinking of Content-Type as a data format.  That is not its
> purpose in MIME and HTTP.

Would you say that media types can express a data format, but that  
MIME and HTTP instead use them to express the author's desired  
handling of the data format.

>>> If you are going to make rules for sniffing, you need to be honest
>>> about the nature of that beast -- no matter what you define, it will
>>> be wrong some percentage of the time.  It is the user's choice to
>>> determine when that is acceptable, not the choice of a standard.
>>
>> Sniffing is certainly a problem. However, browsers vendors are  
>> finding sniffing to be more reliable than content-type headers. So  
>> there's problems with sniffing and there's problems in the process  
>> of affixing and retaining the author/authoring tool intended media  
>> type to a file.
>
> No, sniffing is impossible, and the authoring tool doesn't know the
> intended media type.

However, the authoring tool, along with the author, is in the best  
position to know the intended media type. A big part of the problem  
is that frequently author != server administrator. If we want to  
create a seamless process from author to consumer that passes through  
a network, there needs to be a better way of expressing the media  
type in the authoring process that can be retained throughout  
delivery until it reaches the final consumer of that authored  
content. Filename extensions might be used, but  the filename  
extension cannot always express both the data format for a file and  
its author-intended handling (as might be expressed in the HTTP  
Content-Type header).

> Media types are a protocol issue that is
> related to the data format, but every data format has at least three
> overlapping potential media types (and usually much more than three,
> since the extension space for media types is bounded only by string
> sizes).

Could you provide an example of these overlapping potential media  
types. I'm not following you here.

> The only way that a media type can be assigned is when a
> human makes a choice, by various configuration mechanisms, to assign
> such a type.  DefaultType is one such choice -- it only becomes a bug
> when authors are ignorant of the configuration choices, which in turn
> is a direct result of sniffing in silence.

Part of the problem here is thinking that an author and the server  
admin are the same person. Authors may create content which then gets  
distributed in all sorts of ways beyond their control. Each time the  
authored content changes hands, there's an opportunity to lose the  
metadata that accompanies the file: for the author's intentions to be  
lost. If changing the data handling (as opposed to the data format)  
of a file is important, then we should find some better way to retain  
the metadata with the file content.

Add to this that *nix has evolved beyond the simple filesystems it  
once had and it is clear that not every file without a filename  
extension should necessarily be treated as text file. More  
importantly though,  a server shouldn't even be configurable to give  
a catch-all response when the Content-Type is unknown (when either  
server-side MIMEMagic sniffing or through a filename extension or any  
other method it uses to determine the Content-Type value fails). This  
is especially true since it is impossible to determine whether the  
filename extension metadata is missing or it is a null filename  
extension indicating "text/plain" (and the server also makes use of  
DefaultType for unknown extensions too which it treats as null  
extensions instead).  Since servers are often repositories for large  
and diverse groups of users, it is inevitable that files will get  
loaded without known filename extensions (since we just don't have  
decent protocols in place to ensure these things). If every upload/ 
save operation to a filesystem in any protocol required a consistent  
way to store metadata (and one not as fragile, and  decentralized as  
filename extensions), then we might expect servers to never have  
insufficient Content-Typae information. Since that's not eh case, the  
server has to allow for this missing metadata.

I think this underscores one of the reasons I don't particularly like  
the term "media type". It contributes to this ambiguity (it also is  
easily confused with media description where the term media in each  
case have very different meanings as far as I can tell). If a media  
type can be used to express a data format and it can also be used to  
express a Content-Type, then this language does nothing to create  
clarity in the conversations about the topic. We just get this  
dizzying array of terms that contributes to everyone talking past one  
another.

Take care,
Rob
Received on Friday, 31 August 2007 19:25:52 UTC