Re: [i18n-activity] [WoT Profile] Unclear character set constraints and non-UTF-8 html (#1664)

@himorin Thanks for working on this.

For the table I would change this:

| Relation-Type | Constraint | Remarks |
|---|---|---|
| `service-doc`  | human readable documentation, supported formats are Unicode Text, markdown, HTML and PDF.  |   |

to use the remarks more clearly:

| Relation-Type | Constraint | Remarks |
|---|---|---|
| `service-doc`  | supported media types are: `text/plain`, `text/html`, `text/markdown` and `text/pdf`  | Human readable documentation  |

And I would go on to add a paragraph under the table:

> The types `text/plain`, `text/html`, and `text/markdown` MUST include a `charset` parameter (for example, `text/plain;charset=utf-8`) and the linked files MUST use the UTF-8 character encoding. The type `text/pdf` uses Unicode in its encoding.

Note well: RFC2854 defines `text/html` and is not obsolete. When the `charset` parameter is missing, the default encoding is Latin-1 (and specifically `iso-8859-1`). In practice browsers treat Latin-1 as `windows-1252` and HTML5 sniffs the encoding in various ways (weighted towards trying to find UTF-8). However, it is still a good idea to use `charset=UTF-8`.

Annoyingly, the definition for type `text/markdown` in RFC7763 is actually unhelpful, but it *requires* a charset parameter and does not make UTF-8 (or any other encoding) the default because (and I quote):

> \[...\] its syntax rules operate on characters (specifically,  on punctuation) rather than code points.  Many Markdown processors will get along just fine by operating on characters in the US-ASCII repertoire (specifically punctuation),  blissfully oblivious to other characters or codes.

Therefore, in 6.6.2 I would include the `charset=UTF-8` on all three of the first rows. I would then add a similar paragraph to the one in 6.6.1 saying approximately:

> The types `text/plain`, `text/html`, and `text/markdown` MUST include a `charset` parameter (for example, `text/plain;charset=utf-8`) and the linked files MUST use the UTF-8 character encoding. The types `application/json`, and `application/ld+json` are already restricted to UTF-8. The type `text/pdf` uses Unicode in its encoding. Binary types, such as `image/jpeg` or `application/octet-stream`, do not have a character encoding associated with them or define the encoding internally.

-- 
GitHub Notification of comment by aphillips
Please view or discuss this issue at https://github.com/w3c/i18n-activity/issues/1664#issuecomment-1460913124 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Wednesday, 8 March 2023 21:40:46 UTC