[FT] FTTimes from m to n from andrewc on 2005-05-30 (public-qt-comments@w3.org from May 2005)

From: andrewc <andrew.cao@cisra.canon.com.au>
Date: Tue, 31 May 2005 09:46:58 +1000
To: public-qt-comments@w3.org
Message-ID: <429BA5F2.9010104@cisra.canon.com.au>

Dear editors,

If I have XML data:
<book>A B A B A C B</book>
and query:
/book ftcontains "A" && "B" occurs from 2 to 3 times

FTWords ("A") will return 3 matches at position (1) (3) (5).
FTWords ("B") will return 3 matches at position (2) (4) (7).
FTAnd will return 9 matches at position (1,2) (1,4) (1,7) (3,2) (3,4) 
(3,7) (5,2) (5,4) (5,7)

According to the semantics, "occurs from 2 to 3 times" is evaluated by 
"occurs at least 2 times && ! occurs at least 4 times".
Now if we evaluate "! occurs at least 4 times":
FTTimes "occurs at least 4 times" has 9 input matches, and will produce 
126 output matches.
After eliminating duplication, we get (numbers in parenthesis are word 
positions) 13 output matches:

AllMatches
--- Match (1,2,3,4,7)
--- Match (1,2,3,4,5,7)
--- Match (1,2,4,5,7)
--- Match (1,2,4,5)
--- Match (1,2,3,4)
--- Match (1,2,3,7)
--- Match (1,2,3,4,5)
--- Match (1,2,3,5,7)
--- Match (1,2,5,7)
--- Match (1,3,4,7)
--- Match (1,3,4,5,7)
--- Match (2,3,4,5,7)
--- Match (2,3,5,7)

If we apply FTUnaryNot on this AllMatches, because the first Match has 5 
StringMatches, the second Match has 6 StringMatches, and so on, we will 
have:
5 x 6 x 5 x 4 x 4 x 4 x 5 x 5 x 4 x 4 x 5 x 5 x 4 = 384 million
possible combinations of StringMatches. Even if we do not materialize 
these combinations at one time, the potential computation is huge. And 
the query evaluation is only half way.

This is simple query and simple data. But evaluating this query may 
incur such a a huge computation. What if "A" and "B" happen 10 times 
each in /book?
My question is: Is my evaluation correct? If it is correct, would it be 
nice if we could simplify the semantics of FTTimes, FTUnaryNot, etc.? Or 
is there a lot of places to optimize?

Thanks,
Andrew

Received on Monday, 30 May 2005 23:48:34 UTC