Re: id's in the SVG2 spec from Cameron McCormack on 2012-07-29 (www-svg@w3.org from July 2012)

From: Cameron McCormack <cam@mcc.id.au>
Date: Mon, 30 Jul 2012 09:03:18 +1000
To: Chaals McCathieNevile <w3b@chaals.com>
CC: SVG public list <www-svg@w3.org>
Message-ID: <5015C136.9020503@mcc.id.au>

Chaals McCathieNevile:
> This is not a simple problem. Your proposal works so long as you put
> everything in the right place first time - which is probably going to
> happen in a majority of cases but not all. A similar approach is to have
> sub-headings at pretty detailed granularity, so even if you move them
> around, they exist. And if you remove something, changelogs help as a
> place to collect the orphan ids.

One technique that comes to mind is to assign each paragraph within a 
section a unique ID that is a hash of its text content.  Whenever a 
change to the spec is made, we compute the edit distance between each 
paragraph in the section of the old revision of the spec and those in 
the new revision.  (I think choosing the edit operations to work on a 
whole word rather than a character makes sense, and would help keep 
computation costs down.)  For each paragraph in the new revision of the 
spec, we choose the corresponding paragraph in the old revision that has 
the lowest edit distance.  If the edit distance is below some threshold, 
for example < 50% of the number of words in the old revision's 
paragraph, then we treat the new paragraph as being the same one and 
re-use the existing unique ID.  Otherwise, we generate a new one.

I'm not sure how well that particular threshold would work or if you'd 
want to add in additional heuristics, but it would be a start.

Received on Sunday, 29 July 2012 23:03:45 UTC