W3C home > Mailing lists > Public > public-i18n-core@w3.org > July to September 2011

Re: HTML5 review comments

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Thu, 21 Jul 2011 18:41:45 +0900
Message-ID: <4E27F459.50206@it.aoyama.ac.jp>
To: Richard Ishida <ishida@w3.org>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Hello Richard,

On 2011/07/20 23:55, Richard Ishida wrote:

> 8.2.2.2 Character encodings
> http://www.w3.org/TR/html5/parsing.html#character-encodings-0
>
> "When a user agent is to use the UTF-16 encoding but no BOM has been
> found, user agents must default to UTF-16LE."
>
> If the HTTP header declares the file to be UTF-16BE, which I believe it
> can, and in which case a BOM should *not* be used, then I think that
> this would not be true.

This strictly depends on what "the UTF-16 encoding" means in the 
sentence you cite. If it means "the encoding labeled as 'UTF-16'", then 
this doesn't include encodings labeled UTF-16BE, and therefore there is 
no problem. If "the UTF-16 encoding" means "any encoding that works like 
UTF-16, independent of the label and other details", then you are right.

My impression from reading "8.2.2.2 Character encodings" is that it's 
talking about the encoding labeled "UTF-16", but it might be helpful to 
check and/or clarify.

UTF-16 is a very special case (UTF-32 has similar issues, but is much 
less important in practice, in particular across the network), because 
it's easy to mix up UTF-16 the general encoding method used for Unicode 
with code units of 16 bits and 'UTF-16' the character encoding (charset) 
label. (Also, in implementations, it's sometimes important to be able to 
separately set "BOM/noBOM", "LE/BE", and the actual label, which is 
difficult if a converter or output routine only takes a 'charset' label 
as a parameter.)

> If the HTTP header declares the file to be
> UTF-16, then there must be a BOM, so I assume that this is a recovery
> mechanism if someone does declare UTF-16 in HTTP but omits the BOM. I'd
> think that some kind of error message would be in order though.

You want an error message like "missing BOM on UTF-16 page"? That's good 
for a validator, but not for a browser.


> 4.6.7 The q element
> http://www.w3.org/TR/html5/text-level-semantics.html#the-q-element
>
> The default stylesheet of browsers should render quotes differently
> according to the language of the text. It would be helpful to point this
> out in this section. It would also be helpful to clarify that the
> default stylesheet rendering can be overridden by a user stylesheet. It
> would be nice to have an example that illustrated this.
>
> It would also be useful to provide a few ready-made examples in section
> http://www.w3.org/TR/html5/rendering.html#punctuation-and-decorations,
> including styles for quotes within quotes, which are also done
> differently in non-English text.
>
> See http://www.w3.org/TR/CSS2/generate.html#quotes-specify for the CSS
> quotes property, which would be more appropriate for the rendering section.
>
> [I need to consider this last comment more carefully after reading the
> relevant CSS info. I'm leaving here just to remind me to do that.]

The story of <q> is really interesting. I think Francois was the one 
proposing it, or at least the one proposing the language-dependent 
quotes thing. Semantically, this was the right thing, but for about 15 
years, implementations were hopelessly behind, to the extent that I 
thought we'd have to give up on the quotes (adding them in the text 
wouldn't be that big of a problem; people are used to adding 
./;/:/!/?/...). Apparently, lately browsers have finally caught up, and 
it looks like this is going to work out.

As for the default stylesheet, it would be great to have lots of 
languages specified, but it'll be a lot of work, and no end.

In any case, please make sure that the quotes are added based on the 
language outside of the quotation, not the language of the quotation 
itself. As an example,

   <p lang='fr'>Il dit <q lang='en'>Hello everybody!</q>.</p>

should be rendered something like

    Il dit «Hello everybody!».

and

  <p lang='en'>He said <q lang='fr'>Bonjour tout le monde!</q>.</p>

should be rendered something like

   He said "Bonjour tout le monde!".

But if you look at 
http://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks, you 
see that's only the start, there's issues with spacing, with multi-line 
quotes, and so on.


Regards,    Martin.
Received on Thursday, 21 July 2011 09:43:07 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 21 July 2011 09:43:08 GMT