W3C home > Mailing lists > Public > public-dwbp-wg@w3.org > March 2015

Re: NY Property Tax Explorer

From: Phil Archer <phila@w3.org>
Date: Fri, 27 Mar 2015 15:47:07 +0000
Message-ID: <55157B7B.5090102@w3.org>
To: Steven Adler <adler1@us.ibm.com>
CC: DWBP WG <public-dwbp-wg@w3.org>
On 27/03/2015 14:41, Steven Adler wrote:
>
> Bart,
>
> A PDF might not conform to your definition of a best practice,

It does not.

  but NYC is
> publishing tens of thousands of PDF's that describe property taxes,
> hospitals, crime reports, and housing inspections.

All of which were derived from actual data somewhere, data that may have 
been buried and obfuscated along the way. The NYC tax office does not 
use PDFs to do their calculations, the NYPD doesn't use it to record 
their crime stats etc.

We want to encourage NYC to publish that stuff as close to the original 
format as they can, preferably as part of business as usual, part of the 
everyday workflow.

>
> My point is that if we restrict our recommendations of best practices to
> only conform to what we define as the best file types, we are deliberately
> limiting the relevance of our work in the real world.

PDF, JPGs or whatever, are one star data (assuming it's openly 
licensed). Yes it's there. Yes, you can access it - but you have to work 
hard to do so, essentially reverse-engineering the document to get what 
you want out of it. Structured data like spreadsheets are better, 
non-proprietary and structured data, like CSV, is better still (because 
you're not locked into a vendor's tools).

Publishing data in PDF is not best practice. It's lazy practice. It's "I 
really don't care about this but my boss says I have to do it" practice, 
or it's "I can't be bothered" practice, or it's "if I do this will they 
leave me alone?" practice, it's "how can I present the story I want to 
tell" practice. It's better than not doing it at all but any document 
that calls itself a Best Practice doc won't be taken seriously if it 
encourages data publication in PDF, or videos, or images of graphs. 
That's what you get *after* you've done something with the data so that 
humans can understand it.

As discussed in the 5 star thread, I don't think we should push everyone 
into publishing 5 star Linked Data, but I *do* think we should encourage 
people to publish data that can easily be transformed into it, or any 
other format. And, again, PDF fails that test. CSV+ (i.e. the output of 
the CSV on the Web WG) is an example that passes it.

Taking Makx's words:

"... if you want to do A, then if you publish data as X you will have 
the following advantages and disadvantages, and you should really 
consider format Y to increase usefulness of your data."

If you want to present a report, PDF is fine since you're publishing 
information for a human to read and understand. The disadvantage of PDF 
is that it is more difficult to extract data from it. HTML is better.

But, what you should consider doing is publishing your report in PDF, 
OK, but also publishing the underlying data in CSV (plus metadata) so 
other people can manipulate the data for themselves.

Phil.

>
>
> |------------>
> | From:      |
> |------------>
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
>    |Bart van Leeuwen <bart_van_leeuwen@netage.nl>                                                                                                     |
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
> |------------>
> | To:        |
> |------------>
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
>    |Steven Adler/Somers/IBM@IBMUS                                                                                                                     |
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
> |------------>
> | Cc:        |
> |------------>
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
>    |"DWBP WG" <public-dwbp-wg@w3.org>                                                                                                                 |
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
> |------------>
> | Date:      |
> |------------>
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
>    |03/27/2015 10:35 AM                                                                                                                               |
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
> |------------>
> | Subject:   |
> |------------>
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
>    |Re: NY Property Tax Explorer                                                                                                                      |
>    >--------------------------------------------------------------------------------------------------------------------------------------------------|
>
>
>
>
>
> I think we try to assemble a 'best practice' with this working group.
> I sincerely hope you don't consider data published in a PDF to conform to
> this best practice.
>
> I'm not arguing that it is possible to get usable data from these formats,
> but they were not intended to carry data in a machine readable way.
>
> Bart
>
> Steven Adler <adler1@us.ibm.com> wrote on 27-03-2015 15:09:32:
>
>> From: Steven Adler <adler1@us.ibm.com>
>> To: "DWBP WG" <public-dwbp-wg@w3.org>
>> Date: 27-03-2015 15:10
>> Subject: NY Property Tax Explorer
>>
>> You may recall I submitted a use case about this example from NYC
>> last year.  The developer, Chris Wong, who works for Socrata, wrote
>> a Ruby routine to scrape 1000 PDF files for property tax data to
>> fill out this map app:
>>
>> http://www.w3.org/2013/dwbp/track/issues/56
>>
>> Chris is a self-taught developer, by no means a pro.  I think this
>> story well demonstrates that Data on the Web today is quite
>> innovative and PDF, JPG, AVI, MP3, and MP4 are commonly machine readable.
>
>>
>> Restricting our recommendations to file formats that conform only
>> those covered by W3C WG's (JSON, CSV, RDF, etc) ignores the reality
>> of how Open Data is published and used.
>>
>>
>> Best Regards,
>>
>> Steve
>>
>> Motto: "Do First, Think, Do it Again"
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Friday, 27 March 2015 15:46:55 UTC

This archive was generated by hypermail 2.3.1 : Friday, 27 March 2015 15:46:56 UTC