Re: Eliminate duplicates in expansion?

> On 05/23/2012 07:17 AM, Markus Lanthaler wrote:
> > in a recent update to the test suite Dave changed the behavior of
> > expansion to remove duplicates in sets. Is this what we wanna do?
> 
> Yes. It doesn't make sense for a set to contain a duplicate of the same
> member (per the mathematical definition of a set):
> 
> "A set is a gathering together into a whole of definite, distinct
> objects of our perception [Anschauung] and of our thought – which are
> called elements of the set."
> 
>     -- Georg Cantor, Beiträge zur Begründung der transfiniten
> Mengenlehre
> 
> The key phrase there being "distinct objects".

In understand what a set is. My concern was that we introduce a lot of overhead in expansion with very little advantages. An application will have to eliminate duplicates again as sets aren't merged at that phase in the processing piplelined yet. An subject could be represented several times in the expanded output each of which could hold a subset of "the set". In contrast, in the subject map generation algorithm we collect all data that belongs to a subject and so that it makes sense to eliminate duplicates there.


> > So, e.g., "prop": [ 1, 2, 2, 2, 2, 3 ] will now get expanded to
> > "prop": [ 1, 2, 3 ] (of course as @value objects). Is this what we
> > wanna do? Or is this something we should do as part of framing resp.
> > subject map generation?
> 
> I think all of the algorithms should clean sets... we could also take
> the position that no cleaning should be done for performance reasons.
> That's really the strongest counter-point I can see now - performance...
> because multi-hundred-thousand-member sets are not going to be
> performant for this algorithm.

Yes, exactly.. the problem is that expansion is used as the base of every other algorithm and sometimes even multiple times. In framing e.g. it gets called 2 (+1) times and then we check again for duplicates in the subject map generation. If you have large sets this adds up (3 x n²).

If you feel strongly about this I'm not opposed to change my implementation but I think it a) does not bring any advantages and b) has a potential huge performance cost.


Cheers,
Markus


---
Markus Lanthaler
@markuslanthaler








































-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

Received on Thursday, 24 May 2012 03:47:38 UTC