[duerst@w3.org: Re: Privacy Watchdog and internationalization]

------- Start of forwarded message -------
Return-Path: <duerst@w3.org>
X-Sender: duerst@sh.w3.mag.keio.ac.jp
Date: Sun, 25 Oct 1998 15:58:53 +0900
To: "Rolf H. Nelson" <rnelson@tux.w3.org>
From: "Martin J. Duerst" <duerst@w3.org>
Subject: Re: Privacy Watchdog and internationalization
In-Reply-To: <199810141337.JAA11596@tux.w3.org>
Content-Type: text/plain; charset="us-ascii"

Hello Rolf,

I have read your paper on Privacy Watchdog. I'm not really sure
what I should think about the overall approach, but I'm not a
specialist in privacy issues. I thought that a success rate of
50% looks like both being rather low to be satisfactory, but
also like being easy to do quite a bit better.

Also, at some points I had difficulties understanding whom/what
the paper was written for. In some places, it looked like a
spec for an implementor or an experiment design report for a grant
proposal, in other places it read like a research paper for
a conference. But that's probably because you don't know yet
what it will end up as.

For i18n, your paper looks quite good in that you regularly
mention the issue of different languages at the appropriate
places. Really doing some internationalization would of course
be very nice, and I think it should not be too difficult.

As the whole thing is just heuristics only, I guess the easiest
trick would be to just add the equivalents of "name", "first name",
"last name",... to the list of things that the program searches
for. For a reasonable set of five or ten European languages, the
chance that one of these words has an actual (and different) meaning
in one of the other languages is still very low. The contribution
of such a possibility to the overall failure percentages would
probably be minimal, and it would be extremely easy to extend
the program, because just a list has to be extended, not additional
mechanisms are needed. 

Even for non-western languages, things wouldn't probably get much
worse. What comes in in addition to language is character encoding.
For western Europe, Tim started out with iso-8859-1, and so this
area is quite nicely uniform, but most other parts of the world
didn't get that message. In Japan, for example, three encodings
are customarily used in parallel. The easiest way to deal with
this is to have the search engine work on bytes (which it probably
already does, although nobody sees it that way), and for example
for the Japanese word for "name", just generate the byte sequence
in every encoding, and add all three to the list of things that get
checked. Because many such encodings use the higher part of the
bytes (8-bit), collisions with Latin-based lanugages are rare.
One additional complication might be that e.g. Japanese does
not separate words, so you may have to call a slightly different
search function.

What you should also add to the search are more equivalents
of the same thing in the same language. For example, the term
"first name" is very popular in the US, but because different
parts of the world arrange names in different orders, it's somewhat
inappropriate in an international context. Usually, "given name"
is suggested (I think our meeting registration forms are also
now updated to this). So you should also check for this.

After adding more and more languages, it may at some point turn
out that there are too many false positives. At this point, some
refined strategy may have to be introduced. But I don't think
that for a test implementation, it's worth to do taht from the
start. Refinements are possible in various directions:

- - Based on the Accept-Language header of the request
- - Based on the Content-Language header of the page
  [both of these are not very reliable because language
   negotiation and therefore these headers are not used
   that much]
- - Based on heuristics for detecting the language and encoding
  of the page: Such heuristics can be very efficient, but I don't
  know of any freely available source for them.
- - Based on heuristics over more than one page that the user sees
  (e.g. collecting all the "Content-Language" headers where they
   are present to get an idea of the languages the user understands)

It may of course also be possible that the user can configure the
setup with the languages she thinks she will look at. In that case,
the initial entry for the setup should be taken from Accept-Language,
beause it's probably just what is needed if the user took the pain
to configure that correctly.

Even if a preference only contains e.g. Japanese, for field names,
the program should nevertheless also sniff for English equivalents,
as often forms are copied from a book or another site, and keep
the original field names.


I hope this gives you some ideas of what to do. It should be rather
easy to try and see what the simples heuristic (just add more words)
can do.

I'm looking forward to further discussion.

> The best time to contact me by phone is between 3 and 4 eastern
> standard time, when I make a point of being reachable in my office.

That won't work, because that's in the middle of the night here.

Regards,  Martin.


P.S.: Why does your return address go to tux.w3.org?
------- End of forwarded message -------


-- 
| Rolf Nelson (rolf@w3.org), Project Manager, W3C at MIT
|   "Try to learn something about everything
|             and everything about something."  --Huxley

 

Received on Friday, 30 October 1998 12:47:31 UTC