- From: Tex Texin <tex@i18nguy.com>
- Date: Thu, 06 Nov 2003 00:19:03 -0500
- To: ishida@w3.org
- Cc: public-i18n-geo@w3.org
Richard, The issue is that the FAQ simply recommends removing the BOM as a solution and offers no indication of possible consequences. We just need to indicate that removing the BOM can have some impact. In terms of what that impact may be, it depends on how the file is used. We can look at risks to notepad users, but the file might be touched by many other applications, any of which might have a dependency on the BOM as a signature. With respect to Notepad, one of the considerations we discussed is that Windows has 3 file systems. One of them, NTFS, maintains additional information about files. We speculated it recorded encoding. So, if notepad made use of that, the BOM might be less necessary in that environment. (Which I suspect is also your environment, as NTFS is the default for hard disks.) However, it won't have that information on other file systems. I think one test is to open a file on a floppy, which won't be ntfs. Also notepad has some awareness of html format and perhaps other markup filetypes, so to test this, we need to make sure of these dependencies. e.g. I am sure all your files have a meta charset statement. But others may not. Also, you and I can remove BOMS on our systems, our files aren't being processed by complex development or complicated web publishing systems. However, files that are in such an environment such as Deborah might have, could easily have some dependency on bom detection. But the point is that the few tests we made today are not going to cover all the possible scenarios where removing a BOM can have an impact. As for concerns, they are two-fold. a) In an environment where people are blissfully letting notepad and other software use the BOM as an indicator, if one individual reads the faq, starts removing BOMs, and he/she and others are unaware of the consequences, they don't know to start paying attention to encoding when they open the file. So it's nice that it is shown on the open dialog, but nobody will pay attention to it until after a problem occurs. And if they use right-click, they won't even get the choice. b) The danger is that a file is opened as iso 8859-1 for example, and as markup is mostly ascii, it is possible for the encoding problem not to be noticed, and then if the file is saved as utf-8, the non-ascii data will be corrupt. (Latin-1 characters written out instead of Unicode, as utf-8) Saving it as utf-8 might require some overt user action, but is not beyond belief. Another scenario is the user copies and pastes data from the iso file to another correctly utf-8 file, but the data pasted will now be incorrect. The solution to the FAQ is just to include a sentence or two indicating that if you are going to remove a BOM, you should know how the file is used and verify whether it will have an impact, (or remove it and monitor that subsequent processing doesn't break). Also, as we noted, BOM is not the only reason there might be problem lines or characters in front of files. e.g. misinterpretation of the encoding, perhaps an older or incorrect version of the software. And an alternative solution is to upgrade or replace software to make it BOM aware. (Although that doesn't work for CSS yet.) hth tex Richard Ishida wrote: > > Tex, > > I have realised this evening that although I use Notepad all the time, I > almost never used the Open dialog box ! I simply right click on a file > and say Open with > Notepad, or drag and drop the file into an open > Notepad window. I find that when either of these approaches I never > have an issue with Notepad failing to recognise that this is utf-8 and I > always see my file displayed correctly. > > If I use the Open dialog box I note that it appears to auto-detect the > signature before opening a file and automatically changes the Encoding > option when it is able to detect a utf-8 file. But if this fails, it is > not forcing me to open the file as ANSI. If there's no auto-detection > of the utf-8 signature, it doesn't seem to me that you are in any way in > an unusual situation when compared to general editing of files. If your > file is encoded as iso-8859-5, you'd generally need to know that and > specify that when opening the file in any editor. > > So I think that with Notepad, rather than us being faced with a problem, > we are, on the contrary, simply being offered some useful help if it can > sniff and detect that this is a utf-8 file. Otherwise, all things are > equal with what you would naturally expect. > > So I'm not convinced that there is any particular issue to address here. > > RI > > ============ > Richard Ishida > W3C > > contact info: http://www.w3.org/People/Ishida/ > > http://www.w3.org/International/ > http://www.w3.org/International/geo/ > > W3C Internationalization FAQs > http://www.w3.org/International/questions.html > RSS feed: http://www.w3.org/International/questions.rss -- ------------------------------------------------------------- Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com Xen Master http://www.i18nGuy.com XenCraft http://www.XenCraft.com Making e-Business Work Around the World -------------------------------------------------------------
Received on Thursday, 6 November 2003 00:20:01 UTC