- From: Kevin Smathers <kevin.smathers@hp.com>
- Date: Mon, 10 Nov 2003 07:57:34 -0800
- To: "Butler, Mark" <Mark_Butler@hplb.hpl.hp.com>
- Cc: SIMILE public list <www-rdf-dspace@w3.org>
I'm sorry, I thought I had included some indication of what I was talking about. Normal forms 'C', and 'D' are two alterative ways of constructing Unicode strings. Unicode includes defines several characters that are not themselves printable but instead are used to modify the presentation of a previous character. The German umlaut (diaeresis), or the French accent-grave are two examples. In normal form 'C' the modifier and the character are combined into a single code-point, thus 'a' (0x0061) + '̈' (DIAERESIS (0x0308)) becomes 'ä' (0x00E4). In normal form 'D' the code-points are always split into basic character plus combining diacritical, the reverse of the previous description. It becomes slightly more complex than this because a single letter can have multiple combining diacritical marks, but that is the basic rule. Normal form 'D' turns out to be better for searching because most users who don't have a foreign language keyboard attached to their computers will substitute use the simple form of the character when requesting a search. Thus OCLC's matching service will retrieve 'Häss' even if what you type is 'Hass'. Cheers, -kls Butler, Mark wrote: >Hi Kevin, > >Please can you give a bit more details about what Normal Form 'C' and 'B' >look like? > >thanks, > >Mark > > > >>-----Original Message----- >>From: Kevin Smathers [mailto:kevin.smathers@hp.com] >>Sent: 07 November 2003 23:54 >>To: SIMILE public list >>Subject: Another update to IMS >> >> >> >>Hi all, >> >>While working with IsaViz I found some oddities in the graphs that >>turned out to have happened because IsaViz inexplicably refuses to >>create any node whose content isn't in Normal Form 'C'. So >>now canon.pl >>converts its data to the required normal form, from the OCLC results >>which were in Normal Form 'B'. >> >>I've also taken the liberty of adjusting the OCLC search >>results so that >>responses from OCLC where there are more than five results and the >>lastname field doesn't match are ignored unless they use different >>character sets (ie they obviously were translated into another >>language.) This allows me to get rid of several useless links in the >>result set. >> >>Cheers, >>-kls >> >>-- >>======================================================== >> Kevin Smathers kevin.smathers@hp.com >> Hewlett-Packard kevin@ank.com >> Palo Alto Research Lab >> 1501 Page Mill Rd. 650-857-4477 work >> M/S 1135 650-852-8186 fax >> Palo Alto, CA 94304 510-247-1031 home >>======================================================== >>use "Standard::Disclaimer"; >>carp("This message was printed on 100% recycled bits."); >> >> >> >> -- ======================================================== Kevin Smathers kevin.smathers@hp.com Hewlett-Packard kevin@ank.com Palo Alto Research Lab 1501 Page Mill Rd. 650-857-4477 work M/S 1135 650-852-8186 fax Palo Alto, CA 94304 510-247-1031 home ======================================================== use "Standard::Disclaimer"; carp("This message was printed on 100% recycled bits.");
Received on Monday, 10 November 2003 10:59:54 UTC