W3C home > Mailing lists > Public > xmlschema-dev@w3.org > January 2011

UTF-8 and Chinese

From: Pete Cordell <petexmldev@codalogic.com>
Date: Mon, 10 Jan 2011 09:57:43 -0000
Message-ID: <D1248BCD76ED464495B507B48AB05341@Codalogic>
To: <xmlschema-dev@w3.org>
Liam mentioned below:

> Twitter's limit of 140 bytes is much shorter when you're typing in
> Chinese, and most characters use 2, 3 or 4 bytes each...

I've previously assumed that languages such as Chinese generally require 
fewer characters than European languages do to say what they want. 
Therefore, although UTF-8 requires more bytes per character, as fewer 
characters are required, you can express in Chinese much the same as you can 
in English (for example) in the same number of bytes.

Is there any validity to this hypothesis, and does anybody have any 
experience of this?

Thanks,

Pete Cordell
Codalogic Ltd
Interface XML to C++ the easy way using C++ XML
data binding to convert XSD schemas to C++ classes.
Visit http://codalogic.com/lmx/ or http://www.xml2cpp.com
for more info
----- Original Message ----- 
From: "Liam R E Quin" <liam@w3.org>
To: "Costello, Roger L." <costello@mitre.org>
Cc: <xmlschema-dev@w3.org>
Sent: Monday, January 03, 2011 11:34 PM
Subject: RE: Express length constraints in a regex or use maxLength and 
minLength?



On Mon, 2011-01-03 at 15:25 -0500, Costello, Roger L. wrote:

> I did an investigation into the set of characters that are used in
> English family names and the length of English family names. I found
> that 99.999% of all English family names are no longer than 100
> characters.

So now you ask yourself about the cost of processing on in a thousand
names by hand.

When I lived in Boston, Verizon was unable to give me Internet service
becasue of "a problem with your name" they said.  I have two middle
initials, and their application form didn't allow this, which meant that
my name in their database didn't match the name on my credit card.

When I lived in the UK, if a firm sent you a bill/invoice with your name
incorrectly spelled, you didn't have to pay it.  So there were people
who received electricty bills with their name hand-corrected (e.g,
famously at the time, a Mr. 9smith, with a digit in his name, which he
did because he knew this and was a COBOL programmer... for a long time
the electricity company was unable to bill him, and he had free
electricity, until they started writing the bill out by hand).

>  If I don't impose a limit then the risk of getting unwanted/malicious
> values increases

What evidence do you have for this assertion?

> By identifying the operational restrictions and incorporating them
> into my XML Schema I reduce risk and lower costs. Do you agree?

Not if you get it wrong.

Mike Kay's 'phone number example is another common one.

I was once unable to buy a ticket for a flight because the airline's Web
site checked that your US ZIP code matched the city for your credit card
billing address... and their page for Canada still required a postal
code to have all digits, when in fact they contain letters. I was over
an hour on the 'phone with them and in the end they managed to find an
old hand credit-card swiping machine to process the order!  Later they
fixed it, and wrote to me to say so, but it's reasonable to imagine that
other people had tried and had simply given up and booked with a
different airline.

UK postal addresses tend to be longer than US ones, but US and Canadian
street numbers are often much higher; a UK form that allowed street
numbers to have 4 digits would seem plenty, but I've seen 5-digit street
numbers (12,314 Main street) here.

Twitter's limit of 140 bytes is much shorter when you're typing in
Chinese, and most characters use 2, 3 or 4 bytes each...

I can tell you, though, that most hotel booking systems use 15
characters for a last name; I know this because of a colleague with a
sixteen-character last name.

I once wrote a database that stored people's names, and, although it did
not impose length limits, it did check they were alphabetic. Then I had
to deal with someone whose name had | in it... the character I'd used as
a field separator.  And that's how I learned about regional ASCII
(ISO646) variants in which e.g. \ | [ { ] } are letters.

So, in general, it's a cost/benefit analysis - what are the costs of
being unable to represent some people's names, addresses, phone numbers,
and how much does it cost to be able to represent them?  If you are an
emergency service, it might cost lives. If you're a utility company, you
might have a legal obligation to provide service, and have to find a way
to deal with it, even if it's a pile of invoices someone writes out by
hand each month. If you're a business, you might lost the occasional
customer and you might be willing to accept that.

As for the regular expression vs maxLength, use maxLength of course,
because it's clearer.  Use a regular expression if you want to validate
the content, e.g. to make sure it's all alphabetic (but remember names
with spaces in them, like de Fontaine, or von Roessler, or with an
apostrophe d’Artagne, or a hyphen, fforbes-Hamilton, yes with ff at the
start and not F)... and for 'phone numbers leave room to write, "ask for
extension 46 and if there's no reply tell them to page me."

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
Received on Monday, 10 January 2011 09:58:22 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 11 January 2011 00:15:31 GMT