Re: Minimizing data volume

On 10-9-2013 13:33, Andy Turner wrote:
>
> Hi,
>
> At least these two OGC standards might be worth having a look at in 
> this context:
>
> http://www.opengeospatial.org/standards/geosparql
>
> http://www.opengeospatial.org/standards/tjs
>
> The latter is a Georeferenced Table Joining Service Implementation 
> Standard. In the development of this a lot of thought went in to 
> different kinds of linking of geographical data. Sorry, but I know 
> very little about the GeoSPARQL standard.
>
> The notion of keeping geometry data separate and providing metadata 
> about geometries in standard forms is useful. For vector data, the 
> number of points in the geometry is one of the key attributes an 
> application might consider before pulling that geometry. (Also the 
> size of its representation in bytes -- both compressed and 
> uncompressed is useful info too -- thanks Leigh.)
>
> So, for vector data, the attributes for individual vectors (almost 
> like features) can be kept separate from the spatial geometries, and 
> some linkage code can be used to join the data together. Yes, there 
> are advantages in terms of storage organisation for keeping attributes 
> and geometries separate, but for many applications some attributes of 
> the geometries are also wanted, this geometrical metadata is important 
> to think about. Computationally some of it can be hard to calculate, 
> so once calculated it is perhaps worth storing in optional metadata.
>
> Individual points with a single attribute, where the point is defined 
> with respect to axes in some geographical coordinate and projection 
> system are simple geo-vectors. Lines built from multiple such points 
> (and equations) are more detailed/complex, yet these can have simply 
> attributed generalised point representations (the location of a 
> smallest circle/sphere encompassing all the points in the line -- 
> perhaps with a measure of the radius of this). There are similar 
> things for regional polygons in two and three dimensions.
>
> With lines and points, their geometries can be simplified in other 
> ways which can result in other lines and polygons. Simplifying 
> contiguous polygons to maintain topological relationships is not 
> necessarily straightforward.
>
> The point I am trying to make with the above is that there are 
> multiple different geometries, not a single geometry for a real world 
> object that can be described/defined with RDF. Some of the more 
> generalised forms of the spatial geometries can be calculated and 
> stored as metadata in fixed number of field type table 
> representations. Often so called bounding boxes and bounding circles 
> are use, as are line lengths, perimeters, surface areas, volumes, 
> average distances, and ratios of these geometrical attributes. Based 
> on the geometrical attributes, further attributes can be derived for 
> other attributes (e.g. density).
>
> Consider something complex, like a city. This has multiple geometrical 
> representations.
>
> Two more things:
>
> Geohashes (http://en.wikipedia.org/wiki/Geohash) which interleave 
> coordinates represented by positions on axes using some predetermined 
> axis order and prescription are useful in the context of linking data 
> - as they are string representations, that the more truncated they 
> are, provide less precision for the location of a point, but they 
> start with the same string sequence.
>
> The other key dimension to think about in geographical relations is 
> time. How time relates to all this is important, but this email is 
> already long, so all I will sate is that a city now could be very 
> different to a city some years ago (in terms of spatial 
> dimension/geometry), yet in some ways they are the same place. There 
> are ways to derive (very) complex geometries of ephemeral events, you 
> could consider one, like the Olympic games.
>
> HTH and sorry for the long post.
>
Hello Andy,

Thank you for the long post and for sharing your thoughts.

Yes, I agree that any real world object can have many different 
geometries, depending on coordinate reference system, level of detail, 
time, method of measurement and whatnot. But I don't think that is a 
problem. Linked Data is very capable of sharing different perspectives 
of a single real world phenomenon, and also of annotating those 
different perspectives to help with correctly interpreting them.

The problem that I see is how to handle those cases where geometry 
literals become unwieldy. The GeoSparql specification that you mention 
provides a way of writing a geometry as a literal in RDF. There may be 
several approaches as to how to serialize a geometry, but ending up with 
series of coordinates is inescapable. And I am worried about the impact 
of these series of coordinates becoming very long. That is why I also do 
like the idea of providing some extra data to enable a client to 
distinguish between large and small geometries. The small ones could be 
downloaded and processed right away, but the bigger ones might need some 
extra care.

Thinking about this, I wonder if the idea of a general compression 
function for literals has ever been considered for SPARQL. That would 
enable a query like

SELECT ?name, ?population, COMPRESS(?geometry) FROM 
<http://example.org/cities>

Such a function could be used only for those literals whose size exceeds 
a certain threshold. And it would be applicable to all kinds of big 
literals.

About Geohash: Yes, it is a kind of compression for geometry, but as far 
as I can tell it only applies to single points.

Regards,
Frans

> Andy
> http://www.geog.leeds.ac.uk/people/a.turner/
>
> *From:*Frans Knibbe | Geodan [mailto:frans.knibbe@geodan.nl]
> *Sent:* 10 September 2013 11:11
> *To:* Leigh Dodds
> *Cc:* public-lod community
> *Subject:* Re: Minimizing data volume
>
> On 9-9-2013 16:48, Leigh Dodds wrote:
>
>     Hi,
>
>       
>
>     Before using compression you might also make a decision about whether
>
>     you need to represent all of this information as RDF in the first
>
>     place.
>
>       
>
>     For example, rather than include the large geometries as literals, why
>
>     not store them as separate documents and let clients fetch the
>
>     geometries when needed, rather than as part of a SPARQL query?
>
>       
>
>     Geometries can be served using standard HTTP compression techniques
>
>     and will benefit from caching.
>
>       
>
>     You can provide summary statistics (including size of the document,
>
>     and properties of the described area, e.g. centroids) in the RDF to
>
>     help address a few common requirements, allowing clients to only fetch
>
>     the geometries they need, as they need them.
>
>       
>
>     This can greatly reduce the volume of data you have to store and
>
>     provides clients with more flexibility.
>
>       
>
>     Cheers,
>
>       
>
>     L.
>
> Yes, that is something to consider. Thanks for broadening my mind! I 
> think such an approach may be suited for certain kinds of high volume 
> data, like images or video. But I do have some doubts about its 
> effectiveness for geographical data:
>
> 1) In geographical data sets geometries typically have different 
> sizes. Some may be very big, others may be reasonably small. So where 
> to draw the limit?
>
> 2) When using SPARQL and RDF it is already possible to provide summary 
> statistics and leave it to the client to fetch the geometries if 
> needed. However, it is not standard practice to provide summaries like 
> centroid, bounding box or coordinate count for each geometry, but 
> perhaps it should be.
>
> 3) On the surface, this approach seems to add complexity to data 
> retrieval, for both clients and servers. Instead of one way of 
> publishing and getting data, there will be two ways.
>
> 4) Having to fetch geometries one at a time, instead of processing 
> them all from one data set, could complicate matters and also 
> introduce some loss of performance. I can imagine this method working 
> well for things like images, videos or files, because they are 
> typically used one at a time. But in many cases geometries should be 
> available all at once, to draw on a map for instance.
>
> 5) I think most geometries are stored as attribute data in relational 
> databases. Preprocessing them to make them available as separate files 
> can be done offline. But in other cases the geometries are transient, 
> they could be generated by a function in a query. The method should 
> work with performance gains in those cases too.
>
>
> Regards,
> Frans
>
>       
>
>       
>
>       
>
>     On Mon, Sep 9, 2013 at 10:47 AM, Frans Knibbe | Geodan
>
>     <frans.knibbe@geodan.nl>  <mailto:frans.knibbe@geodan.nl>  wrote:
>
>         Hello,
>
>           
>
>         In my line of work (geographical information) I often deal with high volume
>
>         data. The high volume is caused by single facts having a big size. A single
>
>         2D or 3D geometry is often encoded as a single text string and can consist
>
>         of thousands of numbers (coordinates). It is easy to see that this can cause
>
>         performance issues with transferring and processing data. So I wonder about
>
>         the state of the art in minimizing data volume in Linked Data. I know that
>
>         careful publication of data will help a bit: multiple levels of detail could
>
>         be published, coordinates could use significant digits (they almost never
>
>         do), but it seems to me that some kind of compression is needed too. Is
>
>         there something like a common approach to data compression at the moment?
>
>         Something that is understood by both publishers and consumers of data?
>
>           
>
>         Regards,
>
>         Frans
>
>           
>
>           
>
>       
>
>       
>
>       
>
> -- 
> --------------------------------------
> *Geodan*
> President Kennedylaan 1
> 1079 MB Amsterdam (NL)
>
> T +31 (0)20 - 5711 347
> E frans.knibbe@geodan.nl <mailto:frans.knibbe@geodan.nl>
> www.geodan.nl <http://www.geodan.nl> | disclaimer 
> <http://www.geodan.nl/disclaimer>
> --------------------------------------
>


-- 
--------------------------------------
*Geodan*
President Kennedylaan 1
1079 MB Amsterdam (NL)

T +31 (0)20 - 5711 347
E frans.knibbe@geodan.nl
www.geodan.nl <http://www.geodan.nl> | disclaimer 
<http://www.geodan.nl/disclaimer>
--------------------------------------

Received on Tuesday, 10 September 2013 19:10:06 UTC