Re: utf8 bom FAQ: lets publish this week

still limping without a disk... so hard to write- here are
a few comments. It is very good, mostly editorial nits.

1) Can Question be more succint? I dont think "removing" is
needed it is implied.

Perhaps change:

When I'm using a UTF-8 encoding, why does an extra line
appear at the top of my web page in some user agents, and
how do I remove it?

to

Why do utf-8 encoded pages show extra lines at the top?


2) Suggested changes to answer section, para by para-

a) Perhaps Change-

This may be caused by the presence of a UTF-8 signature at
the beginning of the file that the user agent doesn't
recognize. Note that a number of more recent browsers, such
as the latest Windows-based versions of Internet Explorer,
Mozilla (Netscape) and Opera, do not exhibit this
behaviour.

to:

Some user agents do not (yet) treat the UTF-8 signature at
the beginning of the file properly.
Many of the more recent browsers, such as the latest
Windows-based versions of Internet Explorer, Mozilla
(Netscape) and Opera, do process the UTF-8 signature
correctly.
(should we say "windows-based"? for Mozilla and Opera it is
probably true for other platforms.)


b) For the next para, perhaps after "manually" add "with an
editor". In second sentence perhaps change "can remove" to
"automate removal of":

To remove the extra line or spaces that appear in the
browser, remove the bytes that represent the UTF-8
signature. You can remove them manually or with a script.
One of the benefits of using a script is that you can
remove the signature from multiple files.

c) For next para, I would change "cause of extra line..."
to "UTF-8 signature".
Also, we should include the other way to remove bom- An
editor that is bom-aware might have an option to save
without the bom. So you might not need to edit it, just
open and save correctly.

You may not be able to see the cause of the extra line or
space in your editor, if it handles the UTF-8 signature
correctly. An editor which does not handle the UTF-8
signature correctly displays the bytes that compose that
signature according to its own character encoding setting.
With the Latin 1 (ISO 8859-1) character encoding, the
signature displays as extraneous characters 﫿. With a
binary editor capable of displaying the hexadecimal byte
values in the file, the UTF-8 signature displays as EF BB
BF.

d) next para is fine, but I would move "thoroughly" after
"signature. Also instead of "but that" use "but note that
only", and add to the end something like ", so look
specifically for some of these chars to check."

You should check thoroughly the result of removing the
signature, bearing in mind that pages with a high
proportion of Latin characters may look correct
superficially but that characters outside the ASCII range
(U+0000 to U+007F) may be incorrectly encoded.

If there is no evidence of a UTF-8 signature at the
beginning of the file, then your problem lies elsewhere.

e) For the following para- how do we mean "include"? Also,
if embedded in the middle of the file it is likely going to
display as something other than blank lines, no?

Note that if you include text from a separate file that has
a UTF-8 signature at the top you may find blank lines
appearing within the page, rather than just at the top.



On Tue, 25 Nov 2003 17:49:19 -0000
 "Richard Ishida" <ishida@w3.org> wrote:
> 
> Chaps,
> 
> After discussion with Deborah I have uploaded another
> version of
>
http://www.w3.org/International/questions/qa-utf8-bom.html
> that includes
> Martin's comments.  
> 
> Let's try to publish this on Thursday.  Please send in
> any other
> comments asap, then we'll have a final discussion during
> the meeting
> tomorrow.
> 
> Cheers,
> RI
> 
> ============
> Richard Ishida
> W3C
> 
> contact info: http://www.w3.org/People/Ishida/ 
> 
> http://www.w3.org/International/ 
> http://www.w3.org/International/geo/ 
> 
> W3C Internationalization FAQs
> http://www.w3.org/International/questions.html
> RSS feed: http://www.w3.org/International/questions.rss
> 
> 

Received on Wednesday, 26 November 2003 13:42:57 UTC