[whatwg] finding a number... from Ian Hickson on 2008-05-07 (public-whatwg-archive@w3.org from May 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 7 May 2008 02:16:49 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0805070201210.21650@hixie.dreamhostps.com>
On Tue, 12 Dec 2006, Charles McCathieNevile wrote:
> 
> why does finding a number in text [1] insist on "." as a decimal 
> seperator, when , is also very commonly used?
> 
> [1] http://www.whatwg.org/specs/web-apps/current-work/#steps

The short answer is that we had to pick something, and "." was consistent 
with the attribute syntax, CSS, and JS. (The idea is that CSS can be used 
(once number formatting is supported) to render the number as desired, 
though of course that won't be relevant for <meter> and <progress> since 
they are coming first.) In practice, anyway, attributes can be used to 
disambiguate.

Notice that the spec doesn't really support English either -- in English 
large numbers have thousands separated by commas, and the spec doesn't 
support that at all.


On Tue, 12 Dec 2006, Henri Sivonen wrote:
> 
> I think the format should be kept simple (and potentially politically 
> incorrect), because the human-readability is only a legacy fallback 
> issue. That is, users aren't exposed to the number formatting in UAs 
> that actually implement progress bars and gauges.

Indeed.


On Tue, 12 Dec 2006, Anne van Kesteren wrote:
> 
> You might also want to use this algorithm for the proposed 
> class="price". In that case you really want to take into account "," as 
> well.

class="price" is gone; q.v. the currency Microformat.


On Tue, 12 Dec 2006, Henri Sivonen wrote:
> 
> What would 2,500 mean? Would it mean two and a half or two-thousand- 
> five-hundred?

It actually means 2 out of 500. :-)


On Wed, 13 Dec 2006, Charles McCathieNevile wrote:
> 
> It can't. But why bother making a standard that so clearly fails to work 
> in major world languages? Everything should be as simple as possible 
> *and no simpler* - this is too simple. Maybe assuming you can parse 
> numbers out of text is just a dumb idea as a normative part of a spec.

It's only supported as an easy way of doing fallback. In practice I would 
expect people to use integers less than 1000, so the issue doesn't even 
arise.


On Wed, 13 Dec 2006, Charles McCathieNevile wrote:
> > 
> > The attributes always work for any language. For English, the 
> > textContent works as a *bonus*. It isn't that the spec fails to work 
> > for non-English. It is just that a particular *redundant* bonus 
> > feature doesn't work for non-English.
> 
> The problem with this is that it means copying code the natural way 
> doesn't work for some non-english speakers, and they have to read the 
> spec or guess why. And that you therefore aren't really handling the Web 
> as people actually write it, just some part of it.

Yup.


On Wed, 13 Dec 2006, Mikko Rantalainen wrote:
> 
> Perhaps the parser could be specified as follows:
> 
> regexp for "numeric value" is [0-9 ,.]
> scan the numeric value backwards from end
> first character matching regexp [,.] is the decimal separator
> 
> This would correctly interpret numbers such as
> 
> 1,251,152.124
> 634.46
> 453.436.346,235
> 23 236 435 123,121
> 
> It would fail for numbers such as
> 
> 1,234,456.789,012
> 1.234.456,789.012
> 
> but that such formats used in any locale?

That's not a bad idea... Bit complicated though...


On Wed, 13 Dec 2006, J. King wrote:
> 
> It would also fail for integers, which are bound to be very common.

Why would it fail for integers?


On Wed, 13 Dec 2006, Charles McCathieNevile wrote:
> 
> Of course there are a handful of other types of numbers. One thing that 
> is helpful is that in hebrew and arabic, numbers are written LTR even 
> though the rest of the text isn't. I am not sure about other LTR 
> languages - apparently there are a couple of Indic ones. On the other 
> hand, since I am going to meet a handful of people this weekend who 
> specialise in publishing for the Indian government, in at least their 22 
> constitutionally official languages, I will try to remember to ask. One 
> thing that is unhelpful is that in some languages numbers are written 
> using ordinary letters. Although I suspect this use is very rare on the 
> web, as I believe it is pretty much archaic in the relevant languages.
> 
> This is, of course, going down the path of specifying internationalised 
> number picking - something that some people are ust dead against.

Did you learn anything relevant to this discussion from that weekend? (Or 
any other weekend. :-) )


On Wed, 13 Dec 2006, Thomas Broyer wrote:
> 
> If you start using HTML5 new features, why not use them "fully"? I think 
> the rationale behind the 'textContent' stuff is that you then have only 
> one place to edit when you change the <meter> or <progress> values; so 
> that you never have a <progress> showing 25% while the "fallback text 
> content" says it is now 75% (or the opposite).

Right.


> Note that the <meter> description reads (as an "Authoring requirements" !!!):
>
> "The recommended way of giving the value is to include it as contents of 
> the element, either as two numbers (the higher number represents the 
> maximum, the other number the current value), or as a percentage or 
> similar (using one of the characters such as "%"), or as a fraction."
>
> And the <progress> description reads:
>
> "Instead of using the attributes, authors are recommended to simply 
> include the current value and the maximum value inline as text inside 
> the element."
> 
> I fully understand the rationale behind this (ease of editing 
> ?changing the value once, at a single place? means ease of 
> adoption), but I18N is important too.

Sure, but when would "12%" not work for non-English cases?


> Having mandatory attributes also makes writing parsers easier, because 
> you no longer need the "steps for finding one or two numbers of a ratio 
> in a string" part ;-)

Are they that big a problem?


> And how about "vulgar fraction characters" like ? (U+00BC VULGAR 
> FRACTION ONE QUARTER) and other numeric characters from the ?(U+FF10 
> FULLWIDTH DIGIT ZERO) to?(U+FF19 FULLWIDTH DIGIT NINE) or ? (U+2160 
> ROMAN NUMERAL ONE) to ? (U+2182 ROMAN NUMERAL TEN THOUSAND) ranges?

What about them?


> I don't know which is better (parsing only "english-form numbers" out of 
> textContent, or having attributes mandatory), but the two 
> recommendations towards using textContent to convey the values, 
> preferably to using attributes, is only relevant to english 
> documents/sections, or actually any language with the same number 
> formatting rules).
> 
> What I know is that I don't want language-dependent number parsing to 
> appear in the spec ;-)

I think for the common cases (small integers) the current text works in 
all languages. That it works in mildly more complex cases in English is a 
minor bonus.

On Wed, 13 Dec 2006, Michel Fortin wrote:
> 
> I find this recommendation misguided too. In fact, it should say the 
> contrary: it should say that it is recommended to also put the numbers 
> in attributes for non-english documents because the parser is only 
> capable to handle the english format. That may sound a little dumb, but 
> that's because it is, and I fully support making the attributes 
> mandatory.
> 
> At the very least, I think find-a-number should be limited to integer 
> numbers. But even then it'll be limited to languages using our western 
> digits and possibly not suitable for full internationalization.

The problem with putting values in attributes is that it breaks legacy 
UAs. I would guess the "best practice" recommendation would change once 
UAs that support these features are widely used.


On Wed, 13 Dec 2006, Henri Sivonen wrote:
> 
> Actually, with proper Unicode libraries it would be reasonable easy to 
> handle any base-10 integers.
> 
> Such a scheme could be extended to numbers with a fractional decimal 
> part if the decimal separator was given in an attribute explicitly 
> (defaulting to "."). Thousand separators could be dealt with by skipping 
> certain character classes.
> 
> It is reasonable to expect desktop and server software to have access to 
> the Unicode database (e.g. by bundling ICU if there's isn't a proper 
> platform API). However, JScript-based emulation layers on top IE and 
> mobile software may not have access to the Unicode database.

I don't really see much point in going out of our way to support more 
complex numbers.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 6 May 2008 19:16:49 UTC