- From: Rolf H. Nelson <rnelson@tux.w3.org>
- Date: Fri, 30 Oct 1998 12:47:29 -0500
- To: www-privacy-evaluator@w3.org
------- Start of forwarded message ------- Return-Path: <duerst@w3.org> X-Sender: duerst@sh.w3.mag.keio.ac.jp Date: Sun, 25 Oct 1998 15:58:53 +0900 To: "Rolf H. Nelson" <rnelson@tux.w3.org> From: "Martin J. Duerst" <duerst@w3.org> Subject: Re: Privacy Watchdog and internationalization In-Reply-To: <199810141337.JAA11596@tux.w3.org> Content-Type: text/plain; charset="us-ascii" Hello Rolf, I have read your paper on Privacy Watchdog. I'm not really sure what I should think about the overall approach, but I'm not a specialist in privacy issues. I thought that a success rate of 50% looks like both being rather low to be satisfactory, but also like being easy to do quite a bit better. Also, at some points I had difficulties understanding whom/what the paper was written for. In some places, it looked like a spec for an implementor or an experiment design report for a grant proposal, in other places it read like a research paper for a conference. But that's probably because you don't know yet what it will end up as. For i18n, your paper looks quite good in that you regularly mention the issue of different languages at the appropriate places. Really doing some internationalization would of course be very nice, and I think it should not be too difficult. As the whole thing is just heuristics only, I guess the easiest trick would be to just add the equivalents of "name", "first name", "last name",... to the list of things that the program searches for. For a reasonable set of five or ten European languages, the chance that one of these words has an actual (and different) meaning in one of the other languages is still very low. The contribution of such a possibility to the overall failure percentages would probably be minimal, and it would be extremely easy to extend the program, because just a list has to be extended, not additional mechanisms are needed. Even for non-western languages, things wouldn't probably get much worse. What comes in in addition to language is character encoding. For western Europe, Tim started out with iso-8859-1, and so this area is quite nicely uniform, but most other parts of the world didn't get that message. In Japan, for example, three encodings are customarily used in parallel. The easiest way to deal with this is to have the search engine work on bytes (which it probably already does, although nobody sees it that way), and for example for the Japanese word for "name", just generate the byte sequence in every encoding, and add all three to the list of things that get checked. Because many such encodings use the higher part of the bytes (8-bit), collisions with Latin-based lanugages are rare. One additional complication might be that e.g. Japanese does not separate words, so you may have to call a slightly different search function. What you should also add to the search are more equivalents of the same thing in the same language. For example, the term "first name" is very popular in the US, but because different parts of the world arrange names in different orders, it's somewhat inappropriate in an international context. Usually, "given name" is suggested (I think our meeting registration forms are also now updated to this). So you should also check for this. After adding more and more languages, it may at some point turn out that there are too many false positives. At this point, some refined strategy may have to be introduced. But I don't think that for a test implementation, it's worth to do taht from the start. Refinements are possible in various directions: - - Based on the Accept-Language header of the request - - Based on the Content-Language header of the page [both of these are not very reliable because language negotiation and therefore these headers are not used that much] - - Based on heuristics for detecting the language and encoding of the page: Such heuristics can be very efficient, but I don't know of any freely available source for them. - - Based on heuristics over more than one page that the user sees (e.g. collecting all the "Content-Language" headers where they are present to get an idea of the languages the user understands) It may of course also be possible that the user can configure the setup with the languages she thinks she will look at. In that case, the initial entry for the setup should be taken from Accept-Language, beause it's probably just what is needed if the user took the pain to configure that correctly. Even if a preference only contains e.g. Japanese, for field names, the program should nevertheless also sniff for English equivalents, as often forms are copied from a book or another site, and keep the original field names. I hope this gives you some ideas of what to do. It should be rather easy to try and see what the simples heuristic (just add more words) can do. I'm looking forward to further discussion. > The best time to contact me by phone is between 3 and 4 eastern > standard time, when I make a point of being reachable in my office. That won't work, because that's in the middle of the night here. Regards, Martin. P.S.: Why does your return address go to tux.w3.org? ------- End of forwarded message ------- -- | Rolf Nelson (rolf@w3.org), Project Manager, W3C at MIT | "Try to learn something about everything | and everything about something." --Huxley
Received on Friday, 30 October 1998 12:47:31 UTC