RE: more on extended filtering: bug in step 2?

I don’t read the RFC that way.

I don’t think “*-x-banana” should match “x-banana”. The leading “-“ in the range is not optional. Another way to say this is that the “*” on the front is non-optional (rule 1, rule 2).

“*-DE” should not match “x-banana-DE” because of rule 3D. The “DE” in the range would match a concrete (non-private) subtag. The “DE” bearing range that matches “x-banana-DE” is “*-x-DE”. Rule 3D says that when you see a singleton including ‘x’ in the tag (that doesn’t have a match in the range), the match fails. This prevents false positives in which you want to select region “DE” and find tags with some non-region private/extension gorp containing “DE”.

Addison


From: Jeremy J Carroll [mailto:jjc@syapse.com]
Sent: Wednesday, May 07, 2014 9:25 AM
To: www-international@w3.org
Subject: more on extended filtering: bug in step 2?

I have now implemented this - although I have still got to do testing etc.

http://sourceforge.net/p/bigdata/code/HEAD/tree/branches/BIGDATA_RELEASE_1_3_0/bigdata/src/java/com/bigdata/search/ConfigurableAnalyzerFactory.java

the method is LanguageRange.extendedFilterMatch at line 178. Note the license is GPL.

While writing it I think I found an error in the description in RFC 4647 3.3.2 concerning private use tags starting with an "x-"

if the language range is "*-x-banana" I think it should match "x-banana" but it does not,
and if the language range is "*-DE" I think it should not match "x-banana-DE" but it does

Here is the text of the RFC:


To determine a match:



   1.  Split both the extended language range and the language tag being

       compared into a list of subtags by dividing on the hyphen (%x2D)

       character.  Two subtags match if either they are the same when

       compared case-insensitively or the language range's subtag is the

       wildcard '*'.



   2.  Begin with the first subtag in each list.  If the first subtag in

       the range does not match the first subtag in the tag, the overall

       match fails.  Otherwise, move to the next subtag in both the

       range and the tag.



   3.  While there are more subtags left in the language range's list:



       A.  If the subtag currently being examined in the range is the

           wildcard ('*'), move to the next subtag in the range and

           continue with the loop.



       B.  Else, if there are no more subtags in the language tag's

           list, the match fails.



       C.  Else, if the current subtag in the range's list matches the

           current subtag in the language tag's list, move to the next

           subtag in both lists and continue with the loop.



       D.  Else, if the language tag's subtag is a "singleton" (a single

           letter or digit, which includes the private-use subtag 'x')

           the match fails.



       E.  Else, move to the next subtag in the language tag's list and

           continue with the loop.

   4.  When the language range's list has no more subtags, the match

       succeeds.

In some sense I am pointing to a problem in step 2, in my code I fix it at line 186, which corresponds to the following variant of step 2:




   2.  Begin with the first subtag in each list:

       A.  If the first subtag in the range does not match the first

           subtag in the tag, the overall match fails.

       B.  Else, if the first subtag in the range is '*' and the

           first subtag in the language is 'x' then move to the

           next subtag in the range and continue at step 3.



       C.  Otherwise, move to the next subtag in both the

           range and the language tag and continue at step 3.

However I have understood that no one is much interested in this functionality anyway!

Jeremy
Syapse, Inc.

Received on Wednesday, 7 May 2014 16:56:32 UTC