[Bug 7935] New: [F&O] normalize-unicode on codepoints that are not characters.

http://www.w3.org/Bugs/Public/show_bug.cgi?id=7935

           Summary: [F&O] normalize-unicode on codepoints that are not
                    characters.
           Product: XPath / XQuery / XSLT
           Version: 2nd Edition Recommendation
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Functions and Operators
        AssignedTo: mike@saxonica.com
        ReportedBy: oliver@cbcl.co.uk
         QAContact: public-qt-comments@w3.org


The behaviour normalize-unicode is defined by the unicode normalization
specification.

Based on my (somewhat woolly) understanding of the unicode specification, there
are  66 codepoints that do not map to characters, and unicode normalization is
only defined on strings of characters.  Although use of these is not
recommended, they are valid XML characters.

xs:string contains a string of codepoints, which can quite happily include
noncharacters.

For example what should happen with the following query?

normalize-string("", "NFC")

It is worth noting that in .NET, the following expression throws an exception:

"\ufdd0".Normalize(NormalizationForm.FormC)


I am somewhat loathe to catching this exception and adding a workaround when it
is clear that these characters are a bad thing.

Perhaps it is worth allowing implementations to raise an error if these
characters appear in a string that is to be normalized, as the result is not a
valid unicode string.

On a similar note, Constr-cont-document-3 has some of these characters in its
expected result, and I believe that canonicalization is not defined on these
characters for a similar reason.


-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Friday, 16 October 2009 16:53:48 UTC