W3C home > Mailing lists > Public > public-qt-comments@w3.org > May 2005

[FT] FTTimes from m to n

From: andrewc <andrew.cao@cisra.canon.com.au>
Date: Tue, 31 May 2005 09:46:58 +1000
Message-ID: <429BA5F2.9010104@cisra.canon.com.au>
To: public-qt-comments@w3.org

Dear editors,

If I have XML data:
<book>A B A B A C B</book>
and query:
/book ftcontains "A" && "B" occurs from 2 to 3 times

FTWords ("A") will return 3 matches at position (1) (3) (5).
FTWords ("B") will return 3 matches at position (2) (4) (7).
FTAnd will return 9 matches at position (1,2) (1,4) (1,7) (3,2) (3,4) 
(3,7) (5,2) (5,4) (5,7)

According to the semantics, "occurs from 2 to 3 times" is evaluated by 
"occurs at least 2 times && ! occurs at least 4 times".
Now if we evaluate "! occurs at least 4 times":
FTTimes "occurs at least 4 times" has 9 input matches, and will produce 
126 output matches.
After eliminating duplication, we get (numbers in parenthesis are word 
positions) 13 output matches:

AllMatches
--- Match (1,2,3,4,7)
--- Match (1,2,3,4,5,7)
--- Match (1,2,4,5,7)
--- Match (1,2,4,5)
--- Match (1,2,3,4)
--- Match (1,2,3,7)
--- Match (1,2,3,4,5)
--- Match (1,2,3,5,7)
--- Match (1,2,5,7)
--- Match (1,3,4,7)
--- Match (1,3,4,5,7)
--- Match (2,3,4,5,7)
--- Match (2,3,5,7)

If we apply FTUnaryNot on this AllMatches, because the first Match has 5 
StringMatches, the second Match has 6 StringMatches, and so on, we will 
have:
5 x 6 x 5 x 4 x 4 x 4 x 5 x 5 x 4 x 4 x 5 x 5 x 4 = 384 million
possible combinations of StringMatches. Even if we do not materialize 
these combinations at one time, the potential computation is huge. And 
the query evaluation is only half way.

This is simple query and simple data. But evaluating this query may 
incur such a a huge computation. What if "A" and "B" happen 10 times 
each in /book?
My question is: Is my evaluation correct? If it is correct, would it be 
nice if we could simplify the semantics of FTTimes, FTUnaryNot, etc.? Or 
is there a lot of places to optimize?

Thanks,
Andrew
Received on Monday, 30 May 2005 23:48:34 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:57:05 UTC