Re: Capturing the discussion (was Re: NY Property Tax Explorer)

+1 on this paragraph, which I believe to be a true innovation.  This level
of file metadata extraction and publication is not done today and it would
be a huge service to the industry to include this in our BP.  It would
foster more consistent metadata communication amongst any archival, legacy,
current, and future file types.

So maybe we need to say explicitly that when publishing data, metadata
should be available independently. Where the dataset contains embedded
metadata, the publishing environment might extract it automatically but
where this is not the case, the publisher should provide it separately.


Best Regards,

Steve

Motto: "Do First, Think, Do it Again"


|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Phil Archer <phila@w3.org>                                                                                                                        |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |"'DWBP WG'" <public-dwbp-wg@w3.org>                                                                                                               |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |03/29/2015 04:09 AM                                                                                                                               |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Capturing the discussion (was Re: NY Property Tax Explorer)                                                                                       |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|





I thought I'd look at the doc and see how and where this discussion
might be reflected.

There are two relevant sections: metadata and data formats.

I believe we have consensus that when we talk about metadata we
shouldn't talk about the format of the data described.

BP1 and 2 say that metadata should be provided for humans, ideally as a
Web page, and (BP2) that the data should also be machine readable using
either an alternative representation of that page or one of the
technologies for embedding it.

So here we're talking about providing something separate from the
dataset itself. That's because in our heads we're thinking data portals,
catalogues, landing pages etc. We are not thinking about documents,
images and videos that have their own embedded metadata, which is what
Steve has been championing.

Suppose I create a PDF and embed within that a bunch of metadata, have I
done the job?

Well, it depends on the context. As far as Google is concerned, yes. As
far as a less sophisticated portal or catalogue is concerned, usually
no. In other words, that's only enough *if* there is a machine to read
that embedded metadata. And I believe this is not (currently) true in
CKAN and CKAN-like portals for example (dunno about Socrata).

So maybe we need to say explicitly that when publishing data, metadata
should be available independently. Where the dataset contains embedded
metadata, the publishing environment might extract it automatically but
where this is not the case, the publisher should provide it separately.

As for the data formats section, I think it's pretty good as it is, but
suggest that we do encourage the publication of any tabular data in a
machine readable format separate from, but linked to, any document that
describes, presents or summarises it - which is a long winded way of
saying that if you publish a PDF that includes tabular data, you are
encouraged to publish and refer to the original data as well (like a
researcher does/should).

HTH

Phil.



On 28/03/2015 21:58, Phil Archer wrote:
>
>
> On 28/03/2015 13:21, Makx Dekkers wrote:
>> Hi Laufer,
>>
>>
>>
>> Maybe a misunderstanding. I was not saying people should not publish
>> metadata. It is not clear to me how you could get that impression from
>> the message I sent.
>>
>>
>>
>> I was saying we should not make absolute and unqualified statements
>> that may be read as if we think that people who publish PDFs are stupid.
>
> I didn't mean to imply that and apologise if I did.
>
> I'm not calling anyone stupid and, as usual, agree with all that you say.
>
> As Laufer says, there's a lot of agreement in this thread.  Let's
> capture that in the doc.
>
> Phil.
>
>   We should on the contrary convey positive messages pointing out that
> it is more useful to publish tabular data in formats that are better
> suited for machine-processing.
>>
>>
>>
>> Makx.
>>
>>
>>
>>
>>
>> De: Laufer [mailto:laufer@globo.com]
>> Enviado el: 28 March 2015 13:08
>> Para: Makx Dekkers
>> CC: DWBP WG
>> Asunto: Re: NY Property Tax Explorer
>>
>>
>>
>>   Makx,
>>
>>
>>
>> I cannot see in our document we saying to people to not publish
>> metadata. On the contrary, the first BP in the document is to publish
>> metadata.
>>
>>
>>
>> I think we all agree (in general) about this matter. What we need is a
>> very good text in the introduction of our document summarizing the
>> text of this thread.
>>
>>
>>
>> Best Regards,
>>
>> Laufer
>>
>>
>>
>> Em sábado, 28 de março de 2015, Makx Dekkers <mail@makxdekkers.com
>> <mailto:mail@makxdekkers.com> > escreveu:
>>
>>>
>>> Anyone publishing tabular data in a PDF really needs to have a word
with
>>> themselves.
>>>
>>
>> Can we maybe try not to get into these kinds of absolute, unqualified
>> statements?
>>
>> I agree that if someone has tubular data and creates a PDF that contains
>> only a table with just that table is not doing anyone a service.
>> However, if
>> such a table is included in a document that contains explanations and
>> analysis of the data, aimed at a human readership, I don't think PDF is
a
>> bad choice. Of course, the data in the table should be published in a
>> better
>> machine-readable format alongside the PDF. What I would not want to
>> see is
>> that we encourage service providers to publish data only as CSV and
>> discontinue publication of any human-readable information.
>>
>> As Annette says, it depends on the intention.
>>
>> Makx.
>>
>>
>>
>>
>>
>>
>

--


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1

Received on Monday, 30 March 2015 14:02:17 UTC