Re: Use machine-readable standardized data formats / Use non-proprietary data formats from Phil Archer on 2015-08-12 (public-dwbp-wg@w3.org from August 2015)

From: Phil Archer <phila@w3.org>
Date: Wed, 12 Aug 2015 14:49:48 +0100
To: Mark Harrison <mark.harrison@cantab.net>
Cc: Public DWBP WG <public-dwbp-wg@w3.org>
Message-ID: <55CB4EFC.3050909@w3.org>
Thanks Mark,

That's helpful.

As an admin thing, you sent this from your cantab.net e-mail address and 
it bounced from the list because the WG knows you through your GS1 
address (and only WG members can post to this list).

Phil.

On 12/08/2015 14:42, Mark Harrison wrote:
> Hi Phil,
>
> I agree with your proposal to merge these two best practices.
> However, I think we could go even further and point to some good
> examples of web-based data visualisation frameworks (such as D3.js),
> which make it much easier (and more likely) that people could provide
> the underlying data in addition to the human-friendly charts and
> graphs - and also benefit from greater interactivity with the data.
> It may be that lack of awareness of these frameworks/toolkits may be
> one reason why people currently use proprietary application software
> for publishing data obfuscated within static documents.
>
> Perhaps the way we can formulate this is to say that some document
> formats (such as PDF, .doc / .docx and even .xls / .xlsx ) are
> concerned with presentation of information in a particular format or
> layout and therefore carry a significant amount of typesetting /
> formatting information overhead in addition to the underlying data.
> Furthermore, at the time those document-centric formats were
> developed, ease of access to the underlying data and the unambiguous
> meaning of specific data fields might not have been the main priority
> in their design.
>
> When the main priority is to ensure that the underlying data is
> available on the web so that others can re-use it, we recommend using
> simpler data formats such as CSV, TSV, JSON (or better still JSON-LD),
> RDF or XML.
>
> Recognising that data visualisation in the form of graphs and charts
> will remain an important way of communicating the meaning of the data
> in a visually appealing manner that is intuitive for human
> consumption, we might point out that there are now excellent toolkits
> and frameworks (such as D3.js and Raphael.js) which provide a way to
> generate interactive charts from the data as Scalable Vector Graphics
> (SVG), with the possibility of not only embedding the charts and
> underlying data (e.g. as JSON or JSON-LD) within web pages but even
> hyperlinking from the graph or chart back to the underlying datapoints
> within a dataset in a very granular manner and even providing
> interactive web-based controls (such as range sliders, checkboxes
> etc.) that could be used to adjust thresholds etc., show/hide/combine
> traces in a graph or zoom in on fine details, so that people can fully
> interact with the data and its visualisation all from within a web
> page, without the need to use any proprietary application software.
>
>
>
> We need to understand why in the past people have used proprietary
> software to analyse data and prepare graphical charts and point them
> to the best examples of modern web-based alternatives, including
> references to 'cookbook' documentation that helps them to quickly
> overcome the learning curve of using such tools as an alternative.  We
> should also point to some 'gold standard' exemplar websites that are
> already using such tools in that way.
>
>
> One reason why PDF and .doc (and even .rtf) etc. are much less
> suitable is because they are carrying too much additional typesetting
> information - so the underlying data is somewhat obscured within it.
> Even if the data is somehow present, its meaning may not be readily
> apparent.
>
> To give you a concrete example, I use the TCPDF library to prepare PDF
> documents for customised photo calendars for family and friends.  I
> have a database table of holiday dates in each country and another
> database table of birthdays and anniversaries, so that these can be
> marked with red digits instead of black digits.  My PHP script that
> uses the TCPDF library programmatically builds a calendar table by
> generating a set of instructions that specify the (x,y) co-ordinates,
> font colour and font size for all text within the calendar grid, then
> draws the calendar grid as a series of horizontal and vertical lines.
> The photos are then inserted and the PDF document is ready to send to
> the printing company.  In this process of preparing the PDF document,
> there is a loss of machine-readable semantic information that was
> present in the original database.  The database table recorded a tuple
> that '31-12' (31st December) should be marked as "New Year's Eve" in
> the English version or "Silvesterabend" in the German version - but in
> the rendered PDF document, this semantic information has been
> diminished and is no longer machine-readable.  A human being can still
> understand that the rectangle containing a large '31' and a large page
> heading of 'December' or 'Dezember' likely corresponds to 31st
> December in some particular year - and that the observation "New
> Year's Eve" that appears in the same rectangle is related to that
> date.  Even the rectangle for each day is not drawn as a distinct
> rectangle but just as an intersection of horizontal and vertical
> lines.  For purely typesetting purposes, it was not necessary to
> retain the semantic information - and as a result, it is often lost.
>
> Of course that is not to say that PDF or .doc etc. can never contain
> semantic information.  XMP provides one way of embedding semantics
> within such files in a manner that can be relatively easily extracted
> and converted to RDF triples or JSON-LD.  Furthermore, form templates
> in PDF or .doc / .rtf etc. could in principle be tagged with specific
> property names - although in practice, this is not often done unless
> the completed forms are intended to be processed by computers rather
> than human operators.
>
> I hope this helps.
>
> Best wishes,
>
> - Mark
>
> P.S.  If this does not reach the DWBP list, please forward it [ or some of it ]
> (I'm travelling at the moment and using webmail but the sender might
> not appear as my GS1 address, so it may be rejected by the list)
>
>
> On 8/12/15, Phil Archer <phila@w3.org> wrote:
>> Looking at issue-138 and the BPs on Use machine-readable standardized
>> data formats and Use non-proprietary data formats - I can't see that
>> they need to be separate.
>>
>> We want to say that things like CSV, XML, RDF and JSON are good and that
>> PDF, Excel etc. are bad. It's not that they're not machine readable,
>> they are, but they're just much more difficult to process.
>>
>> Splitting up machine readable standardised and non-proprietary suggests
>> we'd need to come up with a proprietary format that's machine readable
>> that's OK in one BP and then in the next say that, oh no, hang on, don't
>> use that, use this non-proprietary one instead.
>>
>> And, Microsoft and Adobe have both made their respective formats
>> available as ISO standards so we can't refer to formal standards as a
>> differentiator.
>>
>> There's also text in there that I have problems with. The how to test
>> section of BP: Use machine-readable standardized data formats  says:
>> "Check that the data format conforms to a known machine-readable data
>> format specification in current use among anticipated data users."
>>
>> I believe the point of sharing data on the Web is that the publisher
>> shouldn't anticipate what someone else will do with the data.
>>
>> So... I'd like to propose to merge those two BPs and amend the text to
>> talk about the value of open standards in making data available with no
>> preconceived ideas of what it might be used for.
>>
>> WDYT?
>>
>> Phil.
>>
>>
>> --
>>
>>
>> Phil Archer
>> W3C Data Activity Lead
>> http://www.w3.org/2013/data/
>>
>> http://philarcher.org
>> +44 (0)7887 767755
>> @philarcher1
>>
>> CONFIDENTIALITY / DISCLAIMER: The contents of this e-mail are  confidential
>> and are not to be regarded as a contractual offer or acceptance from GS1
>> (registered in Belgium).
>> If you are not the addressee, or if this has been copied or sent to you in
>> error, you must not use data herein for any purpose, you must delete it, and
>> should inform the sender.
>> GS1 disclaims liability for accuracy or completeness, and opinions expressed
>> are those of the author alone.
>> GS1 may monitor communications.
>> Third party rights acknowledged.
>> (c) 2012.
>> </a>
>>
>>
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Wednesday, 12 August 2015 13:49:55 UTC