Re: New Tutorial: Character sets & encodings in XHTML, HTML and CSS

Nice doc. From a quick glance:
> Select an encoding that maximizes the opportunity to directly represent
characters and minimizes the need to represent characters by using character
escapes.

Important: you should make people aware that there are many variants of charsets
like Shift-JIS, and that people are strongly recommended to also escape *all*
characters that vary. Cf.

XML Japanese profile
MURATA Makoto Ed., XML Japanese Profile, W3C Note. (See
http://www.w3.org/TR/japanese-xml/.)

That also needs to be clarified in the section starting:
>Only use escapes for characters in exceptional circumstances - create pages
using an encoding that supports all the characters you need

>user agent
Although "user agent" is glossed at first reference, it is still a rather
awkward term. As a tutorial, it might be better to just use the word browser -- 
and say near the top that the term 'browser' is used for simplicity, but the
text really applies to a broader range of so-called user-agents, including [list
other examples!].

>
a.. We recommend the use of XHTML wherever possible; and if you serve XHTML as
text/html we assume that you are conforming to the compatibility guidelines in
Appendix C of the XHTML 1.0 specification.

a.. We recognize that XHTML served as XML is still not widely supported, and
that therefore many XHTML 1.0 pages will be served as text/html.

Isn't this a pretty counter-productive recommendation; it sounds like you are
saying: "we recommend that you use something that won't work on the vast
majority of your users browsers"?

>Where appropriate, declare the page's character encoding by setting the charset
parameter in the HTTP Content-Type header.

This 'feature' is a real pain. The advice needs to be much clearer, something
like.

If all those who will be posting pages can reset the charset parameter, then you
can impose a default on all the pages. If not, don't.

>There are three characters which should always appear in content as escapes
should => must

>The following table lists Unicode characters that should not be used in a
markup context, according to the W3C Note and Unicode Technical Report Unicode
in XML & Other Markup Languages. You should use markup instead.

This needs to be a bit clearer. Many of these are HTML-specific. Unless the XML
DTD/Schema author provided the same facilities, for example, LRE may not be
available.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message ----- 
From: "Matitiahu Allouche" <matial@il.ibm.com>
To: "Richard Ishida" <ishida@w3.org>
Cc: <www-international@w3.org>; <www-international-request@w3.org>;
<www-i18n-comments@w3.org>
Sent: Thu, 2004 Mar 25 00:08
Subject: Re: New Tutorial: Character sets & encodings in XHTML, HTML and CSS


>
> A few comments.  Sorry that they are mostly nitpickings.
>
> 1) Section "Character escapes" mentions &#x597D; as the escape for the
> Hebrew letter Alef.  I don't know how this value was obtained, U+597D is
> not a defined Unicode character.  The right escape IMHO should be &#x05D0.
>
> 2) In section "Consider using a Unicode encoding", instead of
> "Unicode encodings support many languages with a single encoding across
> all pages and forms, regardless of language."
> I suggest
> "All Unicode encodings support many languages and can accomodate all kinds
> of pages and forms containing any mixture of those languages."
>
> 3) In section "When to do this" following mention of the IANA registry,
> add "are" after "there" in "there no disadvantages".
>
> 4) In section "Precedence rules", add "is" after "it" in "since it
> likely".
>
> 5) In the title "entities and numeric charater references (ncrs)", the
> acronym should  be spelled "NCRs", to be coherent with further
> occurrences, and to distinguish the plural "s" from the acronym itself.
>
> 6) In the next paragraph, "are way" should be "are ways".
>
> 7) The example for CSS escape is *not* terminated by a space, despite
> stating in the previous line that it should be.
>
> 8) In section "When to use escapes", the sentence "For example, to
> represent Chinese characters in an ISO Latin 1 document." is not a
> complete sentence, and should be an added clause to the previous sentence
> (separated by comma).
>
> 9) In the table contained in section "Other Unicode characters are OK",
> LRM and RLM are commented as "Deprecated in Unicode".  I am very
> surprised.  What is the basis for such a statement?
>
> 10) In section "Compatibility characters vary in appropriateness, add a
> comma before "in some other cases it denotes a property".
>
>
> Shalom (Regards),  Mati
>            Bidi Architect
>            Globalization Center Of Competency - Bidirectional Scripts
>            IBM Israel
>            Phone: +972 2 5888802    Fax: +972 2 5870333    Mobile: +972 52
> 554160
>
>
>
>
>
> "Richard Ishida" <ishida@w3.org>
> Sent by: www-international-request@w3.org
> 24/03/2004 14:44
>
> To
> <www-international@w3.org>
> cc
>
> Subject
> New Tutorial: Character sets & encodings in XHTML, HTML and CSS
>
>
>
>
>
> The GEO task force has published its first tutorial:
>
>                  Character sets & encodings in XHTML, HTML and CSS
>
> At: http://www.w3.org/International/tutorials/tutorial-char-enc.html
>
>
> This tutorial has been worked on for quite some time by the GEO Task Force
> of the W3C Internationalization Working Group, and it is thought to be
> ready for publication. For an undetermined initial period we will leave
> the status as Draft to indicate that we invite feedback on the document.
>
>
> You can find links to internationalization specifications, FAQs, articles,
> tools, tests, and soon tutorials at http://www.w3.org/International/
>
> ============
> Richard Ishida
> W3C
>
> contact info: http://www.w3.org/People/Ishida/
>
> http://www.w3.org/International/
>
>

Received on Thursday, 25 March 2004 10:56:41 UTC