- From: Kevin Smathers <kevin.smathers@hp.com>
- Date: Mon, 10 Nov 2003 07:57:34 -0800
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: SIMILE public list <www-rdf-dspace@w3.org>
I'm sorry, I thought I had included some indication of what I was
talking about.
Normal forms 'C', and 'D' are two alterative ways of constructing
Unicode strings. Unicode includes defines several characters that are
not themselves printable but instead are used to modify the presentation
of a previous character. The German umlaut (diaeresis), or the French
accent-grave are two examples.
In normal form 'C' the modifier and the character are combined into a
single code-point, thus 'a' (0x0061) + '̈' (DIAERESIS (0x0308)) becomes
'ä' (0x00E4). In normal form 'D' the code-points are always split into
basic character plus combining diacritical, the reverse of the previous
description. It becomes slightly more complex than this because a
single letter can have multiple combining diacritical marks, but that is
the basic rule.
Normal form 'D' turns out to be better for searching because most users
who don't have a foreign language keyboard attached to their computers
will substitute use the simple form of the character when requesting a
search. Thus OCLC's matching service will retrieve 'Häss' even if what
you type is 'Hass'.
Cheers,
-kls
Butler, Mark wrote:
>Hi Kevin,
>
>Please can you give a bit more details about what Normal Form 'C' and 'B'
>look like?
>
>thanks,
>
>Mark
>
>
>
>>-----Original Message-----
>>From: Kevin Smathers [mailto:kevin.smathers@hp.com]
>>Sent: 07 November 2003 23:54
>>To: SIMILE public list
>>Subject: Another update to IMS
>>
>>
>>
>>Hi all,
>>
>>While working with IsaViz I found some oddities in the graphs that
>>turned out to have happened because IsaViz inexplicably refuses to
>>create any node whose content isn't in Normal Form 'C'. So
>>now canon.pl
>>converts its data to the required normal form, from the OCLC results
>>which were in Normal Form 'B'.
>>
>>I've also taken the liberty of adjusting the OCLC search
>>results so that
>>responses from OCLC where there are more than five results and the
>>lastname field doesn't match are ignored unless they use different
>>character sets (ie they obviously were translated into another
>>language.) This allows me to get rid of several useless links in the
>>result set.
>>
>>Cheers,
>>-kls
>>
>>--
>>========================================================
>> Kevin Smathers kevin.smathers@hp.com
>> Hewlett-Packard kevin@ank.com
>> Palo Alto Research Lab
>> 1501 Page Mill Rd. 650-857-4477 work
>> M/S 1135 650-852-8186 fax
>> Palo Alto, CA 94304 510-247-1031 home
>>========================================================
>>use "Standard::Disclaimer";
>>carp("This message was printed on 100% recycled bits.");
>>
>>
>>
>>
--
========================================================
Kevin Smathers kevin.smathers@hp.com
Hewlett-Packard kevin@ank.com
Palo Alto Research Lab
1501 Page Mill Rd. 650-857-4477 work
M/S 1135 650-852-8186 fax
Palo Alto, CA 94304 510-247-1031 home
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");
Received on Monday, 10 November 2003 10:59:54 UTC