Re: Unicode conference papers

Some quick answers.

On 11/21/06, Richard Ishida <ishida@w3.org> wrote:
>
> Hi Mark,
>
> Thanks for making these available.  Would it be possible to add PDF
> versions of the slides?


Possible, but way down on my todo list ;-)

I have a couple of questions about Unicode at Google:
>
> 1. could you explain slide 17 a little (Queries vs. pages)?   What
> quantity does the y axis represent in each case?


The Y axis is a log scale of counts of characters in the respective scripts.
The scale for query characters and web page characters is different; the
values are choice to normalize the result, so that a comparison can be made
between the relative frequency of characters in queries vs pages across
scripts.

2. what is doubly-encoded utf-8?


It is where someone converts text (say Latin-1) to UTF-8, then takes the
result and converts it again (as if it were Latin-1) to UTF-8. A
surprisingly common occurrence (the Yahoo talks, which were quite good, also
mentioned it).

3. Slide 20 (Charset tagging trends) seems to indicate that around 72% of
> HTML pages now contain encoding declarations in the meta tag.  Is that
> correct? (eg. Is the declaration for some pages in the xml
> declaration?)  That seems like a high number (though I'm not complaining).


It surprised us as well. I'm not sure it is reason for celebration quite
yet, since adding encoding declarations only helps if they are more accurate
;-)

I'm surprised that the HTTP header isn't at least as high, though, since I'd
> have thought that many servers are set up to serve a default encoding.  Do
> you have any explanation for that result?


No explanation for the result; this is just what we see.

4. It would be interesting to know what proportion of character encodings
> and language declarations shown are considered to be incorrect (presumably
> the graphs alluded to in question 3 include those).


There is no one way to measure this. One minimal measure of correctness is
whether the conversion encounters any errors, or any of the resulting
characters are unassigned. But this is a lowball estimate, since it doesn't
include cases where there are no formal errors, but the result is garbage
(eg misidentifying Latin-3 as Latin-1).

Cheers,
> RI
>
>
>
> ============
> Richard Ishida
> Internationalization Lead
> W3C (World Wide Web Consortium)
>
> http://www.w3.org/People/Ishida/
> http://www.w3.org/International/
> http://people.w3.org/rishida/blog/
> http://www.flickr.com/photos/ishida/
>
>
>
>
>
> ________________________________
>
>         From: unicode-bounce@unicode.org [mailto:
> unicode-bounce@unicode.org] On Behalf Of Mark Davis
>         Sent: 21 November 2006 02:44
>         To: Unicode
>         Subject: Unicode conference papers
>
>
>         A few people asked about getting my slides from last week's
> conference. I posted them on my site, at http://macchiato.com :
>
>
>         *       Unicode at Google
>         *       Globalization News
>
>         Mark
>
>
>
>

Received on Tuesday, 21 November 2006 23:33:19 UTC