FW: [WSG] choosing encoding, charset and using special characters

 
On Nov 18, 2004 2:08 PM Matthew Cruickshank wrote

> I think it's even a difficult article for techies, because there's little good advice. 
> So here's some good advice, http://www.joelonsoftware.com/articles/Unicode.html

He was talking about (http://www.w3.org/International/tutorials/tutorial-char-enc/)
Maybe we should re think this and others of our documents.

Russ
________________________________

From: info@webstandardsgroup.org [mailto:info@webstandardsgroup.org] On Behalf Of Matthew
Sent: Thursday, November 18, 2004 2:08 PM
To: wsg@webstandardsgroup.org
Subject: Re: [WSG] choosing encoding, charset and using special characters


Hi Julián,

I think it's even a difficult article for techies, because there's little good advice. So here's some good advice, http://www.joelonsoftware.com/articles/Unicode.html



	"In this article I'll fill you in on exactly what every working programmer should know. All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn't believe in germs. Please do not write another line of code until you finish reading this article."



//

"1) Question: Is there a way to use special characters directly in the code? "

If those characters are in 8859-1, then you can use them. But because 8859-1 uses that range along with lots of other encodings some software (like Google) can get confused when it tries to merge multiple charsets. That might be the Google problem you were seeing.


"2) I have seen a lot of webpages that directly use the special character and dont code them as html entities. This pages are displayed correctly. 
Question: Is this a good or bad practice (to use special characters in code, instead of entities)? "

Character entities can use an ASCII encoding, whereas encoded "special characters" use the file encoding (regardless of whether they're Unicode or 8859). So if your software supports Unicode encoding (Eg, a UTF-8 encoded file with 'extended characters' doesn't get mangled) then it doesn't really matter.

There are very few browsers that don't display unicode correctly when given encoded characters or entities. When browsers aren't Unicode aware they tend to display unknown entities as question marks, whereas unknown encoded characters come out as garbled text, if that matters.

So it seems that it's mostly to do with your internal software support, rather than browsers.


"3. In Google results, I found that those special characters arent always correctly displayed."

It seems that Google uses Unicode (it has the metatag, the special characters are Unicode encoded rather than entities). If you do a Google search for "macron site:e-government.govt.nz" you'll see that the Maori language is displaying correctly in Google. So it seems that Google doesn't have a problem with Unicode, but maybe it has a problem with merging multiple 'extended-ascii' charsets on a single page.

I think the general opinion is that unless you've got a legacy system then Unicode, via UTF-8, is where people should already be.



.Matthew Cruickshank
http://holloway.co.nz/ 

Received on Thursday, 18 November 2004 22:54:39 UTC