Re: Capturing the discussion (was Re: NY Property Tax Explorer)

Hi all,

In my opinion, if we're gonna consider the DCAT definition for dataset,
then a dataset can be seen as a package and the metadata (Data Discovery,
Locale Parameters, Data Licenses, Data Provenance, Data Quality, Data
Versioning) that will be provided concerns the dataset itself, i.e., there
won't be specific metadata for each one of different files available in a
specific dataset. For example, if there is a csv file as part of the
dataset, then the metadata that is specific for this file should be
provided according to the work done by the CSV group. The same can apply to
other different file types (pdf, odt, mp3...). Does it make sense for you?

BP for data formats refer to data and not to datasets. In this case, we are
talking about the contents of the dataset. In general, data formats BP say
that publishers should use machine-readable and open data formats, but it
doesn't say to use a specific data format.

We may give a list of data formats with examples in order to illustrate the
usage of specific data formats, like XML, CSV and RDF. However, it's gonna
be difficult to do this for different audio, video and image file formats.

kind regards,
Bernadette







2015-03-30 16:25 GMT-03:00 Laufer <laufer@globo.com>:

> Hello, All,
>
> I am afraid we will come back with the scope discussion again. In my
> opinion we cannot treat all of the forms of publishing data and the forms
> on how consumers will manipulate it.
>
> I think we have two main types of actors consuming data: humans and
> machines (and, of course, there are humans programming the machines).
>
> For me, the central thing that we can point is about metadata. We can talk
> about data too, but I think that each one of the cases will be very
> particular and probably will be treated by a particular WG.
>
> So I think we need Best Practices talking about publishing metadata in
> these two forms, for humans and machines. We can suggest some types of
> metadata, as structure, license, quality, version, etc. We can comment that
> this information could be embedded in documents, could be separated, etc.
> Maybe we can define a BP that metadata should be published in a separate
> document (as phil and steve suggested).
>
> For metadata for machines, we can suggest some vocabularies.
>
> But I think that if we entered in all techniques of consuming/harvesting
> data and the way people should or must publish to facilitate these things,
> we will have to take the whole world in our hands.
>
> Best Regards,
> Laufer
>
>
>
> 2015-03-30 15:57 GMT-03:00 yaso@nic.br <yaso@nic.br>:
>
> Hi all
>>
>> I would like to point that there are other kinds of data that can be
>> extracted from files that are available on the Web, and other methods that
>> are not connected with linked data or the use of tabular data out there :-)
>>
>> In the call Steve mentioned facial recognition on videos, for example,
>> but there are a lot of examples that I could give.
>>
>> On 03/29/2015 05:08 AM, Phil Archer wrote:
>>
>>> I thought I'd look at the doc and see how and where this discussion
>>> might be reflected.
>>>
>>> There are two relevant sections: metadata and data formats.
>>>
>>> I believe we have consensus that when we talk about metadata we
>>> shouldn't talk about the format of the data described.
>>>
>>> BP1 and 2 say that metadata should be provided for humans, ideally as a
>>> Web page, and (BP2) that the data should also be machine readable using
>>> either an alternative representation of that page or one of the
>>> technologies for embedding it.
>>>
>>> So here we're talking about providing something separate from the
>>> dataset itself. That's because in our heads we're thinking data portals,
>>> catalogues, landing pages etc. We are not thinking about documents, images
>>> and videos that have their own embedded metadata, which is what Steve has
>>> been championing.
>>>
>>
>> For mostly publishers of open data it is true, but there is the point of
>> view of the data consumer that works harvesting data on the web that is in
>> this files.
>> There is a lot of data that is extracted from files that are on the Web,
>> files that are in different formats representing documents (pdf included!),
>> videos, images, etc. It seems to me that the processing, delivery for reuse
>> and collecting of this kind of data that can be extracted from files (that
>> are on the web) is something that the working group could adress.
>>
>> Although we do should not mention file formats, which I agree is not in
>> our scope, maybe we can work with best practices for using and reusing data
>> extracted from documents that are on the Web (we might call it public
>> data). I see that there is space for this in some best practices. Privacy
>> Best Practices, for example.
>>
>> It is interesting to think in best practices that can help people that
>> work with harvested data from files to make this data reusable and readable
>> by machines and humans. Softwares like FRED [1] (it produces RDF/OWL
>> ontologies and linked data from natural language sentences) need to be on
>> our horizon when it thinking about data on the Web. I'm sure that we
>> already have Best Practices for this cases, but we can work on the
>> approaches and specify some use cases.
>>
>> For me, metadata can be data that is not only available on user's purpose
>> (when people upload pictures to flickr, for example) - thus data that can
>> be extracted from files using certain techniques. For instance: considering
>> yahoo (flickr) as a publisher and some developer analyzing data from a set
>> of pictures to make sense of, say, colors, my question is: does our BP
>> serve for both of them as guidance?
>>
>>
>> cheers
>> yaso
>>
>> [1] http://wit.istc.cnr.it/stlab-tools/fred
>>
>>
>>
>>
>>
>>> Suppose I create a PDF and embed within that a bunch of metadata, have I
>>> done the job?
>>>
>>> Well, it depends on the context. As far as Google is concerned, yes. As
>>> far as a less sophisticated portal or catalogue is concerned, usually no.
>>> In other words, that's only enough *if* there is a machine to read that
>>> embedded metadata. And I believe this is not (currently) true in CKAN and
>>> CKAN-like portals for example (dunno about Socrata).
>>>
>>> So maybe we need to say explicitly that when publishing data, metadata
>>> should be available independently. Where the dataset contains embedded
>>> metadata, the publishing environment might extract it automatically but
>>> where this is not the case, the publisher should provide it separately.
>>>
>>> As for the data formats section, I think it's pretty good as it is, but
>>> suggest that we do encourage the publication of any tabular data in a
>>> machine readable format separate from, but linked to, any document that
>>> describes, presents or summarises it - which is a long winded way of saying
>>> that if you publish a PDF that includes tabular data, you are encouraged to
>>> publish and refer to the original data as well (like a researcher
>>> does/should).
>>>
>>> HTH
>>>
>>> Phil.
>>>
>>>
>>>
>>> On 28/03/2015 21:58, Phil Archer wrote:
>>>
>>>>
>>>>
>>>> On 28/03/2015 13:21, Makx Dekkers wrote:
>>>>
>>>>> Hi Laufer,
>>>>>
>>>>>
>>>>>
>>>>> Maybe a misunderstanding. I was not saying people should not publish
>>>>> metadata. It is not clear to me how you could get that impression from
>>>>> the message I sent.
>>>>>
>>>>>
>>>>>
>>>>> I was saying we should not make absolute and unqualified statements
>>>>> that may be read as if we think that people who publish PDFs are
>>>>> stupid.
>>>>>
>>>>
>>>> I didn't mean to imply that and apologise if I did.
>>>>
>>>> I'm not calling anyone stupid and, as usual, agree with all that you
>>>> say.
>>>>
>>>> As Laufer says, there's a lot of agreement in this thread. Let's
>>>> capture that in the doc.
>>>>
>>>> Phil.
>>>>
>>>>   We should on the contrary convey positive messages pointing out that
>>>> it is more useful to publish tabular data in formats that are better
>>>> suited for machine-processing.
>>>>
>>>>>
>>>>>
>>>>>
>>>>> Makx.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> De: Laufer [mailto:laufer@globo.com]
>>>>> Enviado el: 28 March 2015 13:08
>>>>> Para: Makx Dekkers
>>>>> CC: DWBP WG
>>>>> Asunto: Re: NY Property Tax Explorer
>>>>>
>>>>>
>>>>>
>>>>>   Makx,
>>>>>
>>>>>
>>>>>
>>>>> I cannot see in our document we saying to people to not publish
>>>>> metadata. On the contrary, the first BP in the document is to publish
>>>>> metadata.
>>>>>
>>>>>
>>>>>
>>>>> I think we all agree (in general) about this matter. What we need is a
>>>>> very good text in the introduction of our document summarizing the
>>>>> text of this thread.
>>>>>
>>>>>
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Laufer
>>>>>
>>>>>
>>>>>
>>>>> Em sábado, 28 de março de 2015, Makx Dekkers <mail@makxdekkers.com
>>>>> <mailto:mail@makxdekkers.com> > escreveu:
>>>>>
>>>>>
>>>>>> Anyone publishing tabular data in a PDF really needs to have a word
>>>>>> with
>>>>>> themselves.
>>>>>>
>>>>>>
>>>>> Can we maybe try not to get into these kinds of absolute, unqualified
>>>>> statements?
>>>>>
>>>>> I agree that if someone has tubular data and creates a PDF that
>>>>> contains
>>>>> only a table with just that table is not doing anyone a service.
>>>>> However, if
>>>>> such a table is included in a document that contains explanations and
>>>>> analysis of the data, aimed at a human readership, I don't think PDF
>>>>> is a
>>>>> bad choice. Of course, the data in the table should be published in a
>>>>> better
>>>>> machine-readable format alongside the PDF. What I would not want to
>>>>> see is
>>>>> that we encourage service providers to publish data only as CSV and
>>>>> discontinue publication of any human-readable information.
>>>>>
>>>>> As Annette says, it depends on the intention.
>>>>>
>>>>> Makx.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>
> --
> .  .  .  .. .  .
> .        .   . ..
> .     ..       .
>



-- 
Bernadette Farias Lóscio
Centro de Informática
Universidade Federal de Pernambuco - UFPE, Brazil
----------------------------------------------------------------------------

Received on Tuesday, 31 March 2015 19:06:05 UTC