W3C home > Mailing lists > Public > public-i18n-geo@w3.org > November 2003

Re: Opening files in Notepad with/without UTF-8 signature

From: Tex Texin <tex@i18nguy.com>
Date: Thu, 06 Nov 2003 00:19:03 -0500
Message-ID: <3FA9D9C7.8D7AB887@i18nguy.com>
To: ishida@w3.org
Cc: public-i18n-geo@w3.org

Richard,

The issue is that the FAQ simply recommends removing the BOM as a solution and
offers no indication of possible consequences. We just need to indicate that
removing the BOM can have some impact.

In terms of what that impact may be, it depends on how the file is used. We can
look at risks to notepad users, but the file might be touched by many other
applications, any of which might have a dependency on the BOM as a signature.

With respect to Notepad, one of the considerations we discussed is that Windows
has 3 file systems. One of them, NTFS, maintains additional information about
files. We speculated it recorded encoding. So, if notepad made use of that, the
BOM might be less necessary in that environment. (Which I suspect is also your
environment, as NTFS is the default for hard disks.)  

However, it won't have that information on other file systems. I think one test
is to open a file on a floppy, which won't be ntfs.
Also notepad has some awareness of html format and perhaps other markup
filetypes, so to test this, we need to make sure of these dependencies. e.g. I
am sure all your files have a meta charset statement. But others may not.

Also, you and I can remove BOMS on our systems, our files aren't being
processed by complex development or complicated web publishing systems.
However, files that are in such an environment such as Deborah might have,
could easily have some dependency on bom detection.

But the point is that the few tests we made today are not going to cover all
the possible scenarios where removing a BOM can have an impact.

As for concerns, they are two-fold.

a) In an environment where people are blissfully letting notepad and other
software use the BOM as an indicator, if one individual reads the faq, starts
removing BOMs, and he/she and others are unaware of the consequences, they
don't know to start paying attention to encoding when they open the file. So
it's nice that it is shown on the open dialog, but nobody will pay attention to
it until after a problem occurs. And if they use right-click, they won't even
get the choice.

b) The danger is that a file is opened as iso 8859-1 for example, and as markup
is mostly ascii, it is possible for the encoding problem not to be noticed, and
then if the file is saved as utf-8, the non-ascii data will be corrupt.
(Latin-1 characters written out instead of Unicode, as utf-8)

Saving it as utf-8 might require some overt user action, but is not beyond
belief. Another scenario is the user copies and pastes data from the iso file
to another correctly utf-8 file, but the data pasted will now be incorrect.

The solution to the FAQ is just to include a sentence or two indicating that if
you are going to remove a BOM, you should know how the file is used and verify
whether it will have an impact, (or remove it and monitor that subsequent
processing doesn't break).

Also, as we noted, BOM is not the only reason there might be problem lines or
characters in front of  files.
e.g. misinterpretation of the encoding, perhaps an older or incorrect version
of the software.

And an alternative solution is to upgrade or replace software to make it BOM
aware.

(Although that doesn't work for CSS yet.)

hth
tex





Richard Ishida wrote:
> 
> Tex,
> 
> I have realised this evening that although I use Notepad all the time, I
> almost never used the Open dialog box !  I simply right click on a file
> and say Open with > Notepad, or drag and drop the file into an open
> Notepad window.  I find that when either of these approaches I never
> have an issue with Notepad failing to recognise that this is utf-8 and I
> always see my file displayed correctly.
> 
> If I use the Open dialog box I note that it appears to auto-detect the
> signature before opening a file and automatically changes the Encoding
> option when it is able to detect a utf-8 file.  But if this fails, it is
> not forcing me to open the file as ANSI.  If there's no auto-detection
> of the utf-8 signature, it doesn't seem to me that you are in any way in
> an unusual situation when compared to general editing of files.  If your
> file is encoded as iso-8859-5, you'd generally need to know that and
> specify that when opening the file in any editor.
> 
> So I think that with Notepad, rather than us being faced with a problem,
> we are, on the contrary, simply being offered some useful help if it can
> sniff and detect that this is a utf-8 file.  Otherwise, all things are
> equal with what you would naturally expect.
> 
> So I'm not convinced that there is any particular issue to address here.
> 
> RI
> 
> ============
> Richard Ishida
> W3C
> 
> contact info: http://www.w3.org/People/Ishida/
> 
> http://www.w3.org/International/
> http://www.w3.org/International/geo/
> 
> W3C Internationalization FAQs
> http://www.w3.org/International/questions.html
> RSS feed: http://www.w3.org/International/questions.rss

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------
Received on Thursday, 6 November 2003 00:20:01 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:28:00 UTC