Re: Proposed addition to Display problems caused by the UTF-8 BOM

+1

Although Notepad is the default editor of almost 90% of PCs, since most 
of the PCs are equipped with Windows. I've had bad experience with 
Notepad because of BOM (lost of original content created by other editors) .

What I would like to add here (of course, nothing to do with 
http://www.w3.org/International/questions/qa-utf8-bom)
is that there is a similar problem with "some" HTML authoring tools and 
the charset meta tag. Example:
1- You create your text containing some Arabic with NVU.
2- You save your HTML file (it is saved as utf-8 encoding).
3- You reopen your file later with your favorite tool NVU. The content 
is then interpreted as ISO Latin, and the bytes (e.g. اي) are 
converted to there HTML entity equivalence (e.g. 
اي)

Indeed, the file is saved with a meta tag indicating the default 
content-type of ISO-8859-1.
<meta content="text/html; charset=ISO-8859-1" http-equiv="content-type" />

The next opening of the file, NVU reinterprets the file as being ISO 
Latin and converts all strange bytes to an HTML entity.

The solution is to change the charset in the meta tag before saving (the 
same problem occurs when you delete the meta tag), or to use text editor 
to change the charset from ISO-8859-1 to utf-8, before re-opening with NVU.

Best, Najib


Martin Duerst wrote:
>  At 00:59 07/07/26, Addison Phillips wrote:
>
> > So I would tend to replace the bit above thusly:
> >
> > -- Some applications, such as text editors, look for the BOM as a
> > signature indicating the use of a Unicode encoding. These
> > applications, such as Windows Notepad, will automatically add a
> > UTF-8 BOM to any file you save as UTF-8 so that they can detect it
> > later. Browsers, however, don't look for the BOM and Web pages
> > always need to declare the character encoding explicitly at the top
> > of the file or in the HTTP header, making a BOM unnecessary (and,
> > as noted above, sometimes harmful). --
>
>  I think this is a good direction, but I'm a bit worried by "such as
>  text editors". This implies that all or most text editors silently
>  add a BOM, which is not true. I would change "such as text editors"
>  to "such as some text editors".
>
>  Also, the "Browsers, however," is a bit of a problem, because it's
>  written as a counterpoint to editors. So I'd rewrite that part a bit,
>  too.
>
>  Regards, Martin.
>
>
> > Just a thought.
> >
> > Addison
> >
> > Richard Ishida wrote:
> >> Chaps, I propose to add the following paragraph to
> >> http://www.w3.org/International/questions/qa-utf8-bom in the
> >> section By the Way: "Applications that look at the text to work
> >> out the
> > character encoding can tell straight away that the text is encoded
> > in UTF-8 if they find a BOM at the beginning.
> >
> > This can save time if the only non-ASCII characters occur a long
> > way down the file (such as a copyright symbol in text at the very
> > end). Web pages, however, ought to declare the character encoding
> > explicitly at the top of the file or in the HTTP header, so a BOM
> > should not be necessary."
> >> Unless I hear any objections, I will make the change,
> >> unannounced, in a couple of days time. Cheers, RI
> >>
> >> ============ Richard Ishida Internationalization Lead W3C (World
> >> Wide Web Consortium)
> >>
> >> http://www.w3.org/People/Ishida/ http://www.w3.org/International/
> >> http://people.w3.org/rishida/blog/
> >> http://www.flickr.com/photos/ishida/
> >>
> >>
> >
> >
> > Richard Ishida wrote:
> >> Chaps, I propose to add the following paragraph to
> >> http://www.w3.org/International/questions/qa-utf8-bom in the
> >> section By the Way: "Applications that look at the text to work
> >> out the character encoding can tell straight away that the text
> >> is encoded in UTF-8 if they find a BOM at the beginning. This
> >> can save time if the only non-ASCII characters occur a long way
> >> down the file (such as a copyright symbol in text at the very
> >> end). Web pages, however, ought to declare the character
> >> encoding explicitly at the top of the file or in the HTTP header,
> >> so a BOM should not be necessary." Unless I hear any objections,
> >> I will make the change, unannounced, in a couple of days time.
> >> Cheers, RI
> >>
> >> ============ Richard Ishida Internationalization Lead W3C (World
> >> Wide Web Consortium)
> >>
> >> http://www.w3.org/People/Ishida/ http://www.w3.org/International/
> >> http://people.w3.org/rishida/blog/
> >> http://www.flickr.com/photos/ishida/
> >>
> >>
> > -- Addison Phillips Globalization Architect -- Yahoo! Inc. Chair --
> > W3C Internationalization Core WG
> >
> > Internationalization is an architecture. It is not a feature.
> >
>
>
>  #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>  #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
>
>
>


-- 
Najib TOUNSI (mailto:tounsi @ w3.org)
Bureau W3C au Maroc (http://www.w3c.org.ma/)
Ecole Mohammadia d'Ingenieurs, BP 765 Agdal-RABAT Maroc (Morocco)
Phone : +212 (0) 37 68 71 50 (P1711) Fax : +212 (0) 37 77 88 53
Mobile: +212 (0) 61 22 00 30

Received on Saturday, 28 July 2007 14:20:48 UTC