- From: Phil Archer <phila@w3.org>
- Date: Wed, 12 Aug 2015 14:49:48 +0100
- To: Mark Harrison <mark.harrison@cantab.net>
- Cc: Public DWBP WG <public-dwbp-wg@w3.org>
Thanks Mark, That's helpful. As an admin thing, you sent this from your cantab.net e-mail address and it bounced from the list because the WG knows you through your GS1 address (and only WG members can post to this list). Phil. On 12/08/2015 14:42, Mark Harrison wrote: > Hi Phil, > > I agree with your proposal to merge these two best practices. > However, I think we could go even further and point to some good > examples of web-based data visualisation frameworks (such as D3.js), > which make it much easier (and more likely) that people could provide > the underlying data in addition to the human-friendly charts and > graphs - and also benefit from greater interactivity with the data. > It may be that lack of awareness of these frameworks/toolkits may be > one reason why people currently use proprietary application software > for publishing data obfuscated within static documents. > > Perhaps the way we can formulate this is to say that some document > formats (such as PDF, .doc / .docx and even .xls / .xlsx ) are > concerned with presentation of information in a particular format or > layout and therefore carry a significant amount of typesetting / > formatting information overhead in addition to the underlying data. > Furthermore, at the time those document-centric formats were > developed, ease of access to the underlying data and the unambiguous > meaning of specific data fields might not have been the main priority > in their design. > > When the main priority is to ensure that the underlying data is > available on the web so that others can re-use it, we recommend using > simpler data formats such as CSV, TSV, JSON (or better still JSON-LD), > RDF or XML. > > Recognising that data visualisation in the form of graphs and charts > will remain an important way of communicating the meaning of the data > in a visually appealing manner that is intuitive for human > consumption, we might point out that there are now excellent toolkits > and frameworks (such as D3.js and Raphael.js) which provide a way to > generate interactive charts from the data as Scalable Vector Graphics > (SVG), with the possibility of not only embedding the charts and > underlying data (e.g. as JSON or JSON-LD) within web pages but even > hyperlinking from the graph or chart back to the underlying datapoints > within a dataset in a very granular manner and even providing > interactive web-based controls (such as range sliders, checkboxes > etc.) that could be used to adjust thresholds etc., show/hide/combine > traces in a graph or zoom in on fine details, so that people can fully > interact with the data and its visualisation all from within a web > page, without the need to use any proprietary application software. > > > > We need to understand why in the past people have used proprietary > software to analyse data and prepare graphical charts and point them > to the best examples of modern web-based alternatives, including > references to 'cookbook' documentation that helps them to quickly > overcome the learning curve of using such tools as an alternative. We > should also point to some 'gold standard' exemplar websites that are > already using such tools in that way. > > > One reason why PDF and .doc (and even .rtf) etc. are much less > suitable is because they are carrying too much additional typesetting > information - so the underlying data is somewhat obscured within it. > Even if the data is somehow present, its meaning may not be readily > apparent. > > To give you a concrete example, I use the TCPDF library to prepare PDF > documents for customised photo calendars for family and friends. I > have a database table of holiday dates in each country and another > database table of birthdays and anniversaries, so that these can be > marked with red digits instead of black digits. My PHP script that > uses the TCPDF library programmatically builds a calendar table by > generating a set of instructions that specify the (x,y) co-ordinates, > font colour and font size for all text within the calendar grid, then > draws the calendar grid as a series of horizontal and vertical lines. > The photos are then inserted and the PDF document is ready to send to > the printing company. In this process of preparing the PDF document, > there is a loss of machine-readable semantic information that was > present in the original database. The database table recorded a tuple > that '31-12' (31st December) should be marked as "New Year's Eve" in > the English version or "Silvesterabend" in the German version - but in > the rendered PDF document, this semantic information has been > diminished and is no longer machine-readable. A human being can still > understand that the rectangle containing a large '31' and a large page > heading of 'December' or 'Dezember' likely corresponds to 31st > December in some particular year - and that the observation "New > Year's Eve" that appears in the same rectangle is related to that > date. Even the rectangle for each day is not drawn as a distinct > rectangle but just as an intersection of horizontal and vertical > lines. For purely typesetting purposes, it was not necessary to > retain the semantic information - and as a result, it is often lost. > > Of course that is not to say that PDF or .doc etc. can never contain > semantic information. XMP provides one way of embedding semantics > within such files in a manner that can be relatively easily extracted > and converted to RDF triples or JSON-LD. Furthermore, form templates > in PDF or .doc / .rtf etc. could in principle be tagged with specific > property names - although in practice, this is not often done unless > the completed forms are intended to be processed by computers rather > than human operators. > > I hope this helps. > > Best wishes, > > - Mark > > P.S. If this does not reach the DWBP list, please forward it [ or some of it ] > (I'm travelling at the moment and using webmail but the sender might > not appear as my GS1 address, so it may be rejected by the list) > > > On 8/12/15, Phil Archer <phila@w3.org> wrote: >> Looking at issue-138 and the BPs on Use machine-readable standardized >> data formats and Use non-proprietary data formats - I can't see that >> they need to be separate. >> >> We want to say that things like CSV, XML, RDF and JSON are good and that >> PDF, Excel etc. are bad. It's not that they're not machine readable, >> they are, but they're just much more difficult to process. >> >> Splitting up machine readable standardised and non-proprietary suggests >> we'd need to come up with a proprietary format that's machine readable >> that's OK in one BP and then in the next say that, oh no, hang on, don't >> use that, use this non-proprietary one instead. >> >> And, Microsoft and Adobe have both made their respective formats >> available as ISO standards so we can't refer to formal standards as a >> differentiator. >> >> There's also text in there that I have problems with. The how to test >> section of BP: Use machine-readable standardized data formats says: >> "Check that the data format conforms to a known machine-readable data >> format specification in current use among anticipated data users." >> >> I believe the point of sharing data on the Web is that the publisher >> shouldn't anticipate what someone else will do with the data. >> >> So... I'd like to propose to merge those two BPs and amend the text to >> talk about the value of open standards in making data available with no >> preconceived ideas of what it might be used for. >> >> WDYT? >> >> Phil. >> >> >> -- >> >> >> Phil Archer >> W3C Data Activity Lead >> http://www.w3.org/2013/data/ >> >> http://philarcher.org >> +44 (0)7887 767755 >> @philarcher1 >> >> CONFIDENTIALITY / DISCLAIMER: The contents of this e-mail are confidential >> and are not to be regarded as a contractual offer or acceptance from GS1 >> (registered in Belgium). >> If you are not the addressee, or if this has been copied or sent to you in >> error, you must not use data herein for any purpose, you must delete it, and >> should inform the sender. >> GS1 disclaims liability for accuracy or completeness, and opinions expressed >> are those of the author alone. >> GS1 may monitor communications. >> Third party rights acknowledged. >> (c) 2012. >> </a> >> >> > -- Phil Archer W3C Data Activity Lead http://www.w3.org/2013/data/ http://philarcher.org +44 (0)7887 767755 @philarcher1
Received on Wednesday, 12 August 2015 13:49:55 UTC