- From: Pete Cordell <petexmldev@codalogic.com>
- Date: Mon, 10 Jan 2011 09:57:43 -0000
- To: <xmlschema-dev@w3.org>
Liam mentioned below: > Twitter's limit of 140 bytes is much shorter when you're typing in > Chinese, and most characters use 2, 3 or 4 bytes each... I've previously assumed that languages such as Chinese generally require fewer characters than European languages do to say what they want. Therefore, although UTF-8 requires more bytes per character, as fewer characters are required, you can express in Chinese much the same as you can in English (for example) in the same number of bytes. Is there any validity to this hypothesis, and does anybody have any experience of this? Thanks, Pete Cordell Codalogic Ltd Interface XML to C++ the easy way using C++ XML data binding to convert XSD schemas to C++ classes. Visit http://codalogic.com/lmx/ or http://www.xml2cpp.com for more info ----- Original Message ----- From: "Liam R E Quin" <liam@w3.org> To: "Costello, Roger L." <costello@mitre.org> Cc: <xmlschema-dev@w3.org> Sent: Monday, January 03, 2011 11:34 PM Subject: RE: Express length constraints in a regex or use maxLength and minLength? On Mon, 2011-01-03 at 15:25 -0500, Costello, Roger L. wrote: > I did an investigation into the set of characters that are used in > English family names and the length of English family names. I found > that 99.999% of all English family names are no longer than 100 > characters. So now you ask yourself about the cost of processing on in a thousand names by hand. When I lived in Boston, Verizon was unable to give me Internet service becasue of "a problem with your name" they said. I have two middle initials, and their application form didn't allow this, which meant that my name in their database didn't match the name on my credit card. When I lived in the UK, if a firm sent you a bill/invoice with your name incorrectly spelled, you didn't have to pay it. So there were people who received electricty bills with their name hand-corrected (e.g, famously at the time, a Mr. 9smith, with a digit in his name, which he did because he knew this and was a COBOL programmer... for a long time the electricity company was unable to bill him, and he had free electricity, until they started writing the bill out by hand). > If I don't impose a limit then the risk of getting unwanted/malicious > values increases What evidence do you have for this assertion? > By identifying the operational restrictions and incorporating them > into my XML Schema I reduce risk and lower costs. Do you agree? Not if you get it wrong. Mike Kay's 'phone number example is another common one. I was once unable to buy a ticket for a flight because the airline's Web site checked that your US ZIP code matched the city for your credit card billing address... and their page for Canada still required a postal code to have all digits, when in fact they contain letters. I was over an hour on the 'phone with them and in the end they managed to find an old hand credit-card swiping machine to process the order! Later they fixed it, and wrote to me to say so, but it's reasonable to imagine that other people had tried and had simply given up and booked with a different airline. UK postal addresses tend to be longer than US ones, but US and Canadian street numbers are often much higher; a UK form that allowed street numbers to have 4 digits would seem plenty, but I've seen 5-digit street numbers (12,314 Main street) here. Twitter's limit of 140 bytes is much shorter when you're typing in Chinese, and most characters use 2, 3 or 4 bytes each... I can tell you, though, that most hotel booking systems use 15 characters for a last name; I know this because of a colleague with a sixteen-character last name. I once wrote a database that stored people's names, and, although it did not impose length limits, it did check they were alphabetic. Then I had to deal with someone whose name had | in it... the character I'd used as a field separator. And that's how I learned about regional ASCII (ISO646) variants in which e.g. \ | [ { ] } are letters. So, in general, it's a cost/benefit analysis - what are the costs of being unable to represent some people's names, addresses, phone numbers, and how much does it cost to be able to represent them? If you are an emergency service, it might cost lives. If you're a utility company, you might have a legal obligation to provide service, and have to find a way to deal with it, even if it's a pile of invoices someone writes out by hand each month. If you're a business, you might lost the occasional customer and you might be willing to accept that. As for the regular expression vs maxLength, use maxLength of course, because it's clearer. Use a regular expression if you want to validate the content, e.g. to make sure it's all alphabetic (but remember names with spaces in them, like de Fontaine, or von Roessler, or with an apostrophe d’Artagne, or a hyphen, fforbes-Hamilton, yes with ff at the start and not F)... and for 'phone numbers leave room to write, "ask for extension 46 and if there's no reply tell them to page me." Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
Received on Monday, 10 January 2011 09:58:22 UTC