Definitions: RFC 2046 and application/*

RFC 2046 defines application/* and text/*.

The only default charset rules are for text/* only, not for application/*.

==== Proposal

We can preserve existing behavior exactly.

1/ We state the current position that N-Triples data is currently served 
as text/plain and the default charset in this case is therefore ASCII.

2/ We register a new MIME type, application/n-triples, default charset 
UTF-8 (see below for rationale)

3/ (Optional) We could also register text/n-triples (default charset ascii).

c.f. application/xml and text/xml.

==== Rationale

== RFC 2046

I went to RFC 2046 which I think defines application/*

[[RFC 2046
4.5.3.  Other Application Subtypes

    It is expected that many other subtypes of "application" will be
    defined in the future.  MIME implementations must at a minimum treat
    any unrecognized subtypes as being equivalent to "application/octet-
    stream".
]]

that's the nearest I could find to a general statement about 
application/*.  There isn't anything about a default charset.  If no 
charset is given, it's octets.

If it's octets, the interpretation is up to the content-type 
registration.  That can be to name a default or require a charset 
parameter be present.  Dafault seems better.

== text/*

Only text/* has the any special rules and the defaulting rule is text/* 
specific.
[see section 4.1 of RFC 2046]

1) the default for text/plain is us-ascii

2) other subtypes must default to us-ascii

3) Unrecognised types can be treated as text/plain

4) Types with unrecognised charsets are treated as
    application/octet-stream.

== Implications

This works well for us because ASCII is a subset of UTF-8 so existing 
N-Triples data can be read, as bytes, as both ASCII and UTF-8 without loss.

If there is no charset on application/n-triples, then the data is passed 
to the processor, untouched (binary octets) and whatever rules this WG 
defines apply which go in the MIME type registration.

== Existing data works in all content types

Reading UTF-8 or ASCII for existing N-Triples data will yield the same 
codepoints i.e. set the default to UTF-8 and there is no problem for 
existing N-Triples data, and what is more, new style data is detectable 
because it is outside legal US-ASCII (even better, treating as binary 
octets preserves the data).

== Existing software

That leaves existing software, new data.

But that is expecting text/plain.  Adding text/n-triples may help if 
there is existing use of such a content type.  (I have seen N-triples 
serves up as all sorts of things.)

We handle this by noting in the spec that text/plain is also used for 
compatibility for N-triples and also note it is required to default to 
ASCII.

Existing software that is not MIME-type sensitive is at the mercy of 
what's fed in regardless of what the working group decides, including 
Turtle.

Existing software fed with existing data for any content-type/charset 
combination describes here will work and be correct.

== Test cases, please!

Let's move to working on specific examples. If something looks broken, 
please provide specific test cases.  It's going to easier to make 
progress if we deal with concrete examples now.

 Andy

Received on Thursday, 8 March 2012 10:37:23 UTC