Re: Minimizing data volume from Frans Knibbe | Geodan on 2013-09-10 (public-lod@w3.org from September 2013)

From: Frans Knibbe | Geodan <frans.knibbe@geodan.nl>
Date: Tue, 10 Sep 2013 12:10:31 +0200
To: Leigh Dodds <leigh@ldodds.com>
CC: public-lod community <public-lod@w3.org>
Message-ID: <522EF017.4020005@geodan.nl>
On 9-9-2013 16:48, Leigh Dodds wrote:
> Hi,
>
> Before using compression you might also make a decision about whether
> you need to represent all of this information as RDF in the first
> place.
>
> For example, rather than include the large geometries as literals, why
> not store them as separate documents and let clients fetch the
> geometries when needed, rather than as part of a SPARQL query?
>
> Geometries can be served using standard HTTP compression techniques
> and will benefit from caching.
>
> You can provide summary statistics (including size of the document,
> and properties of the described area, e.g. centroids) in the RDF to
> help address a few common requirements, allowing clients to only fetch
> the geometries they need, as they need them.
>
> This can greatly reduce the volume of data you have to store and
> provides clients with more flexibility.
>
> Cheers,
>
> L.

Yes, that is something to consider. Thanks for broadening my mind! I 
think such an approach may be suited for certain kinds of high volume 
data, like images or video. But I do have some doubts about its 
effectiveness for geographical data:

1) In geographical data sets geometries typically have different sizes. 
Some may be very big, others may be reasonably small. So where to draw 
the limit?

2) When using SPARQL and RDF it is already possible to provide summary 
statistics and leave it to the client to fetch the geometries if needed. 
However, it is not standard practice to provide summaries like centroid, 
bounding box or coordinate count for each geometry, but perhaps it 
should be.

3) On the surface, this approach seems to add complexity to data 
retrieval, for both clients and servers. Instead of one way of 
publishing and getting data, there will be two ways.

4) Having to fetch geometries one at a time, instead of processing them 
all from one data set, could complicate matters and also introduce some 
loss of performance. I can imagine this method working well for things 
like images, videos or files, because they are typically used one at a 
time. But in many cases geometries should be available all at once, to 
draw on a map for instance.

5) I think most geometries are stored as attribute data in relational 
databases. Preprocessing them to make them available as separate files 
can be done offline. But in other cases the geometries are transient, 
they could be generated by a function in a query. The method should work 
with performance gains in those cases too.


Regards,
Frans

>
>
> On Mon, Sep 9, 2013 at 10:47 AM, Frans Knibbe | Geodan
> <frans.knibbe@geodan.nl> wrote:
>> Hello,
>>
>> In my line of work (geographical information) I often deal with high volume
>> data. The high volume is caused by single facts having a big size. A single
>> 2D or 3D geometry is often encoded as a single text string and can consist
>> of thousands of numbers (coordinates). It is easy to see that this can cause
>> performance issues with transferring and processing data. So I wonder about
>> the state of the art in minimizing data volume in Linked Data. I know that
>> careful publication of data will help a bit: multiple levels of detail could
>> be published, coordinates could use significant digits (they almost never
>> do), but it seems to me that some kind of compression is needed too. Is
>> there something like a common approach to data compression at the moment?
>> Something that is understood by both publishers and consumers of data?
>>
>> Regards,
>> Frans
>>
>>
>
>


-- 
--------------------------------------
*Geodan*
President Kennedylaan 1
1079 MB Amsterdam (NL)

T +31 (0)20 - 5711 347
E frans.knibbe@geodan.nl
www.geodan.nl <http://www.geodan.nl> | disclaimer 
<http://www.geodan.nl/disclaimer>
--------------------------------------
Received on Tuesday, 10 September 2013 10:11:06 UTC