[Bug 2299] Distance constraints do not work on phrases (formerly Cluster G, Issue 63)

http://www.w3.org/Bugs/Public/show_bug.cgi?id=2299





------- Additional Comments From doerre@de.ibm.com  2005-11-28 18:41 -------
To fix this we decided native phrases within matches as StringMatches that span
multiple tokens (intervals). In order to do so, the TokenInfo model has to be
extended to also model token intervals. At the same time this change allows
us to allow for tokenizers producing overlapping tokens.

Summary of the discussion/decision:
 - We need to allow for overlapping tokens for multiple reasons.
 - A phrase can be modeled as a "token" spanning multiple positions. This 
allows to treat it as a unit in constraints like FTDistance.
 - FTDistance constraints always disallow overlapping of tokens. A 
distance of 0 words (sentences/paragraphs) means adjacent word 
(sentence/paragraph).


Summary of changes to the semantics:
In 4.3.1 AllMatches
Change the type TokenInfo to now include the attributes
+startPos: integer
+endPos: integer
+startSent: integer
+endSent: integer
+startPara: integer
+endPara: integer

(as an aside: we also drop the "queryString", because it is not needed in the
semantics.)

4.3.1.3 XML representation (of AllMatches) adapted to the model above.

4.3.1.4 and 4.3.1.5 (Normalization). To be adapted, but not yet done.

4.3.2.9 FTOrder

Throughout the function, instead of testing for "tokenInfo/@pos", we 
should test for "tokenInfo/@startPos", i.e. the order constraint is only 
sensitive to the starting positions of matched tokens.

4.3.2.10 FTScope

Same sentence: the input AllMatches must satisfy, that for each match all 
covered sentence positions in each of the StringIncludes must be the same.
And retain only those StringExcludes that cover that same sentence (or, if no 
StringIncludes, at most one sentence).

Different sentence: for each match the StringIncludes cover disjoint 
sentences. Keep StringExcludes that cover sentences not covered by any 
StringInclude (drop if some sentence covered by both).

Same/different paragraph is analogous.


4.3.2.12 FTDistance

Distance constraints are never satisfied for a match that contains two 
StringIncludes which overlap. Check for each match that the 
list of StringIncludes sorted by startPos is such that for each pair of 
consecutive StringIncludes SI1, SI2 the end position (sentence/paragraph) 
of SI1 (the preceding) is within the required distance from the start 
position (sentence/paragraph) of SI2 (the suceeding). And keep only 
StringExcludes that are within the required distance from one of the 
StringIncludes.

(changed all 12 functions).

4.3.2.13 FTWindow

For each match the minimal startPos and the maximal endPos of the 
StringIncludes must fit into a window of N positions. Drop 
StringExcludes that may not be completely covered by any window covering 
the StringIncludes.

Received on Monday, 28 November 2005 18:41:38 UTC