THIS IS A TEMPORARY PAGE TO HELP ASSEMBLE SUGGESTED EDITS FOR THE UNICODE EDCOM. IT WILL DISAPPEAR ONCE THOSE PROPOSED EDITS HAVE BEEN COMMUNICATED.

Unicode Frequently Asked Questions

Unicode and the Web

Q: My web page is in Latin-1 (ISO-8859-1). So I don'tDo I still need a charset declaration, right?

The FAQ should probably start with the question: What character encoding should i use for my Web pages? The answer is: Only utf-8 ! See https://www.w3.org/International/questions/qa-choosing-encodings.en.html

Then this answer should begin with a reminder that if the reader is intending to create a new page, they should not be using Latin1 at all, unless they have a very obscure reason.

Wrong. You always need a charset declaration, even when you are using Latin-1. To quote from the HTML specification:

The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter. — HTML 4.01

Thus you should always include a charset declaration of the form in the <head> element:

I don't see any reason to recommend use of this old-fashioned, complicated markup. Just recommend <meta charset="utf-8">. All modern browsers recognise that.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Do we really want to tell them how to save a file as Latin1? Btw, is FrontPage still a thing?

Your HTML editor will usually give you an easy way to do this. For example, in Microsoft FrontPage you select File > Properties > Language > HTML EncodingSave the document as, and pick the encoding you want.

Apart from the fact that this will be redundant given the comments higher up, it's incorrect. Latin1 doesn't become utf-8 because you're using HTML5 !

In HTML5 this declaration changes to

<meta charset="utf-8">

It's not the only allowed value. But it's definitely not recommended to use anything else for new documents.

with "utf-8" (or case insenstive variations) the only allowed value. Other charsets are no longer supported.

Q: How should I encode international characters in URLs?

This document is obsolete and out of date. Maybe point to https://www.w3.org/International/articles/idn-and-iri/index.en.html ?

See http://www.w3.org/TR/charmod/#URIs

Q: The appearance of some of the pages on the Unicode site are flawed by the inclusion of illegal characters. John Walker has written an amusing article and an excellent program to purge documents of these problems; see http://www.fourmilab.ch/webtools/demoroniser/

I agree that this question should be removed.

NOTE TO REVIEWERS: this is not actually a question and it contains the answer in the Q part, with the rest being a comment. Suggest deleting it, or moving it to the end if someone wants to reword it.

The demoronizer seems to have a bug in it. The page is written in UTF-8, and it contains the character U+2014 (EM DASH), which is a perfectly reasonable character. In UTF-8 it is encoded as the byte sequence E2 80 94. It appears that demoronizer is ignoring the charset parameter, interpreting it as iso-8859 or some other charset, seeing the 80 byte and marking it as an error. We generally try to run our pages through the W3C validator, which has been upgraded to recognize UTF-8.

Q: We are setting up a database for use with our web server. I understand that if I want to store data into a database, I need to use a consistent character encoding scheme. Does Unicode cover all the character sets we need, for a web server?

Yes, For a database, it is important to use a consistent character encoding scheme. Unicode works perfectly on the backend for keeping all of your data in a consistent format.

Q: Now comes the problem of . Since We will have text from different languages and scripts on our pages, what are our options for delivery of pages?

The answer is simply: use UTF-8! No need to talk about NCRs as a workaround here.

The note about HTML5 being limited to UTF-8 is incorrect, but also misleading. All browsers handle characters internally as Unicode, regardless of what encoding you declare.

In HTML (or XML) you can either use NCRs (numerical character references), or you can choose a charset that will contain all of the characters on the page. Note that the HTML5 charset declaration is limited to UTF-8.

Q: What are NCRs and CERs?

The terminology here is well out of date. 'CERs' are nowadays called named character references (in the HTML5 spece).

Instead of simply including a character such as an “a” in a file, you can instead write it using the character code, as “&#x61;”(the hex value) or “&#97;”. For help with calculating hexadecimal and decimal NCRs, see: https://r12a.github.io/app-conversion/.

You can also add the trademark sign and alpha as characters. The usefulness of character escapes only hinges around difficulties in typing this things, in this case. Since UTF-8 includes all of Unicode, there is no need for circumventing encoding limitations as you used to when serving with another code page. At the very least this should have a heavy disclaimer indicating that this is useful only for those who persist in not using UTF-8 – and it's not a great solution either.

Few people use this for ASCII, of course, but it does allow you to put the occasional character such as a trademark sign (™) or alpha (α) in your text. CERs (character entity references) are similar, except that they use abbreviations, such as “&eacute;” instead of numbers.

Q: What are the pros of using NCRs (and CERs)?

NCRs can be useful when:

a) You know what the Unicode value is (or the abbreviation), but don’ t have a way to enter the character directly in the output character set.

b) Your tools don’t let you edit Unicode text directly.

c) You cannot tell which of the similar looking characters you editor is using, and want to get the precise value.

Q: What are the cons of using NCRs?

NCRs are:

a) hard to maintain (do you read code points and/or abbreviations as well as text?)

b) hard to format

c) not well handled by many search engines

I'm not sure how this was ever true. Certainly not now.

d) most importantly: not compatible with as many browsers as UTF-8

NOTE TO REVIEWERS: is item (d) still correct?

Q: How can I ensure that my document uses an encoding that will not require the use of NCRs?

This somewhat misses the point, since NCRs are never required. If you have this problem you should just encoding your document as UTF-8 !

If you need a multilingual document that spans charsets or you do not want to have to keep track of such things, then UTF-8 is the best alternative. Using UTF-8 directly is much more maintainable than using NCRs, since it is far easier for people to work with the text than with the codepoints. To set the charset to be UTF-8, use the following meta tag:

See above: just recommend the shorter syntax.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

In HTML5 this becomes:

<meta charset=utf-8">

Q: Will my HTML editor automatically fix NCRs for me?

This whole question and answer are moot, since you shouldn't be using any other encoding than UTF-8. If we are going to entertain edge questions about legacy encodings, for people who refuse to just save their page as utf-8 and change the decaration, a better comment might be that NCRs always use Unicode code point values, and are therefore not affected by the declared encoding. And they are all supported by browsers, regardless of the declared encoding (because everyting is converted to unicode internally anyway).

Yes, when you reset the charset on a page, if the right option is set a good editor will add NCR’s when necessary, and convert unnecessary ones into regular characters. For example,

<p>Σ and Я</p> // charset=utf-8</p>
<p>&#931; and Я</p> // charset=iso-8859-5
<p>&#931; and &#1071;</p> //charset=iso-8859-1
<p>Σ and &#1071;</p> // charset=iso-8859-7

Q: We are using forms in HTML. If we use Unicode for all of our HTML pages, does that mean that once the forms are submitted, the user input also gets back to Unicode (i.e. the webserver is able to map the local charset with the Unicode one)?

Again, this should start by asking why your page isn't served using UTF-8. This is such an out of date way of thinking about things.

If you have a single CGI and a single HTML form, then the browsers will return the data in the encoding of the original form, so there is no ambiguity about the charset. If you have a single CGI and multiple (localized) HTML forms which may be use different charsets, then it may not be so simple. While there is a protocol for revealing the charset of a submitted form, it is not always used. Some people use the following skanky trick to get around this: include a hidden field in your form with known characters in them. Based upon the bytes that get sent to you, you can determine the charset that the user typed in. Ugly, but it seems to work.

Q: How does that work, exactly?

The trick works, because the hidden characters will be converted to the user’s charset (like the rest of the form) when it is submitted to you. So by putting, say, a YE in for a Russian page, you can look at the bytes that you receive. Based on those bytes, you decide which of the Russian character sets were used.

NOTE TO REVIEWERS: this question cannot be understood out of context, so I suggested folding it into the previous Answer. Also, is this still the state of the art?

Q: When we send email to people in each country with their data – do we need to convert the Unicode data coming from the database into each individual charset?

Although all modern browsers and email programs will handle UTF-8, some people may be using emailers that do not handle UTF-8. Since unlike HTTP there is no handshake to determine what charsets the email program will accept, at this point in time you probably do need to translate the charset to one that is specific for the user. If you retain the character set in which the user corresponded with you, you can use that. Otherwise you can use one of the common character sets used in email in the user’s country of origin, if you know that.

 

Q: I'm worried about the extra size that my web pages will take if they are encoded in UTF-8. Won't some languages be at a disadvantage?

Are people still worried about this??

As far as size goes, it is worthwhile looking at some real data samples. The following are from a page on the Unicode site that was translated into different languages, so it had essentially the same information on each page.

Size

Page

8882

s-chinese.html

8946

t-chinese.html

9347

esperanto.html

9498

maltese.html

9739

icelandic.html

9833

czech.html

9944

welsh.html

10064

danish.html

10109

swedish.html

10127

polish.html

Size

Page

10219

interlingua.html

10221

italian.html

10297

spanish.html

10308

portuguese.html

10312

lithuanian.html

10329

german.html

10376

romanian.html

10401

korean.html

10506

french.html

Size

Page

10726

japanese.html

10953

hebrew.html

11192

arabic.html

13292

greek.html

13870

russian.html

13892

persian.html

14549

hindi.html

15337

georgian.html

15853

deseret.html

So the best case is about 50% of the worst case. Some of this is due to the encoding, and some is due to different languages just using different numbers of characters. However, when you look at web pages in general use, the amount of text (in bytes) is really swamped by graphics, Javascript, HTML code, and so on. So fundamentally, even the variations above are not that important in practice.

Q: Where can I find out more about using Unicode on the Web?

Perhaps point to https://www.w3.org/International/techniques/authoring-html#charset ?

The W3C maintains FAQs and HTML authoring guidelines under the auspices of the Internationalization Working Group: http://www.w3.org/International/core You can also find information there about subscribing to lists that specialize in answering questions about Web technology and Unicode.[AP]