Re: Another update to IMS

I'm sorry, I thought I had included some indication of what I was 
talking about.

Normal forms 'C', and 'D' are two alterative ways of constructing 
Unicode strings.  Unicode includes defines several characters that are 
not themselves printable but instead are used to modify the presentation 
of a previous character.  The German umlaut (diaeresis), or the French 
accent-grave are two examples.

In normal form 'C' the modifier and the character are combined into a 
single code-point, thus 'a' (0x0061) + '̈' (DIAERESIS (0x0308)) becomes 
'ä' (0x00E4). In normal form 'D' the code-points are always split into 
basic character plus combining diacritical, the reverse of the previous 
description.  It becomes slightly more complex than this because a 
single letter can have multiple combining diacritical marks, but that is 
the basic rule.

Normal form 'D' turns out to be better for searching because most users 
who don't have a foreign language keyboard attached to their computers 
will substitute use the simple form of the character when requesting a 
search.  Thus OCLC's matching service will retrieve 'Häss' even if what 
you type is 'Hass'.

Cheers,
-kls

Butler, Mark wrote:

>Hi Kevin,
>
>Please can you give a bit more details about what Normal Form 'C' and 'B'
>look like? 
>
>thanks,
>
>Mark
>
>  
>
>>-----Original Message-----
>>From: Kevin Smathers [mailto:kevin.smathers@hp.com]
>>Sent: 07 November 2003 23:54
>>To: SIMILE public list
>>Subject: Another update to IMS
>>
>>
>>
>>Hi all,
>>
>>While working with IsaViz I found some oddities in the graphs that 
>>turned out to have happened because IsaViz inexplicably refuses to 
>>create any node whose content isn't in Normal Form 'C'.  So 
>>now canon.pl 
>>converts its data to the required normal form, from the OCLC results 
>>which were in Normal Form 'B'.
>>
>>I've also taken the liberty of adjusting the OCLC search 
>>results so that 
>>responses from OCLC where there are more than five results and the 
>>lastname field doesn't match are ignored unless they use different 
>>character sets (ie they obviously were translated into another 
>>language.)  This allows me to get rid of several useless links in the 
>>result set.
>>
>>Cheers,
>>-kls
>>
>>-- 
>>========================================================
>>   Kevin Smathers                kevin.smathers@hp.com    
>>   Hewlett-Packard               kevin@ank.com            
>>   Palo Alto Research Lab                                 
>>   1501 Page Mill Rd.            650-857-4477 work        
>>   M/S 1135                      650-852-8186 fax         
>>   Palo Alto, CA 94304           510-247-1031 home        
>>========================================================
>>use "Standard::Disclaimer";
>>carp("This message was printed on 100% recycled bits.");
>>
>>
>>    
>>


-- 
========================================================
   Kevin Smathers                kevin.smathers@hp.com    
   Hewlett-Packard               kevin@ank.com            
   Palo Alto Research Lab                                 
   1501 Page Mill Rd.            650-857-4477 work        
   M/S 1135                      650-852-8186 fax         
   Palo Alto, CA 94304           510-247-1031 home        
========================================================
use "Standard::Disclaimer";
carp("This message was printed on 100% recycled bits.");

Received on Monday, 10 November 2003 10:59:54 UTC