RE: CSVW and fixed record length, multiple record length from Wackerow, Joachim on 2016-06-02 (public-csvw@w3.org from June 2016)

From: Wackerow, Joachim <Joachim.Wackerow@gesis.org>
Date: Thu, 2 Jun 2016 06:27:26 +0000
To: "public-csvw@w3.org" <public-csvw@w3.org>, "ddi-rdf-vocabulary@googlegroups.com" <ddi-rdf-vocabulary@googlegroups.com>
Message-ID: <4AC500D943687843823EC9042103F9F0024B9EBCD5@SVMAEXC01.gesis.intra>
Hi Gregg,

Sorry for the late response. I’m currently travelling.

I understand it now this way that one approach could be: a separate specification should describe data with fixed record length and data with multiple records per unit, which can be mapped to the tabular data model. This would also support a related data transformation if desired.
Is this correct?

Regarding data with multiple records per unit:
The data has usually fixed record length. Two variants seem to be common.

1.      Fixed number of records per unit and fixed logical record length. An identifier per unit in each record is not required.

2.      An identifier per unit in each record and possibly a record sequence number per unit. The number of records per unit may vary.

Wendy Thomas from the Minnesota Population Center provided some examples which are publically available. Wendy is happy to answer any questions regarding these examples. She is subscribed to the list ddi-rdf-vocabulary@googlegroups.com<mailto:ddi-rdf-vocabulary@googlegroups.com> (in CC).
http://users.pop.umn.edu/~wlt/MultiRecordCases/


The first example has two physical records per unit. The physical record length is 1800 and 1696 characters.

The second example has a compound identifier per record.
Wendy: What is here the unit identifier?

Achim


From: Gregg Kellogg [mailto:gregg@greggkellogg.net]
Sent: Donnerstag, 26. Mai 2016 19:11
To: Wackerow, Joachim
Cc: public-csvw@w3.org; ddi-rdf-vocabulary@googlegroups.com
Subject: Re: CSVW and fixed record length, multiple record length

On May 26, 2016, at 1:25 AM, Wackerow, Joachim <Joachim.Wackerow@gesis.org<mailto:Joachim.Wackerow@gesis.org>> wrote:

Hello,

I’m wondering if possibilities were discussed (while the development of CSVW) to describe data with fixed record length and data with multiple records per case/unit.

The use cases [1] have examples of fixed-lenght records, but I wasn’t personally involved in discussions about incorporating this; others in the group were likely involved in these discussions.

However, note that the Tabular Data Model [2] allows for other formats, and only non-normatively describes parsing CSV to create an Annotated Data Model. See Embedding Tabular Metadata in HTML [3] which describes extracting tabular data from HTML tables, for example. Ultimately, it’s up to other standards to describe specific media types, which can be mapped to the tabular data model using a separate document, such as [3].


The DDI Alliance developed a draft vocabulary (PHDD) on physical data description of tabular data. We compared now CSVW and PHDD. Our understanding is that CSVW is very powerful for all things described in the original scope of CSVW. It looks like CSVW could be interesting for users of the main DDI specifications. We are now hesitant to work further on the development of PHDD and to publish a final version.

The only area where PHDD has additional features is the description of data with fixed record length and data with multiple records per case/unit. I understand that this is beyond the original scope of CSVW. Nevertheless I’m wondering if it would make sense to add these features to CSVW.

Describing a process for converting fixed length record files into tabular data, would allow you to minimally describe how to work with the Tabular Data Model.

I’m unclear on the use of multiple records per case/unit, and what the implications for mapping that over might be. Some examples for discussion might be useful.


Both features, data with fixed record length and data with multiple records per case/unit, are used heavily in legacy data of older days where space limitations of storage played a major role. The DDI Alliance published a couple of specifications for data that result from observational methods in the social, behavioral, economic, and health sciences. DDI is used by social science data archives, research data producers in the social sciences, and national statistical institutes (NSIs).
Archives and NSIs have still a large amount of data with fixed record length and data with multiple records per case/unit.

I’m hoping this is the right forum to raise these questions. I copied the message to the discussion forum on DDI RDF vocabularies.

Certainly, that’s the purpose of this forum and the Community Group.

Gregg Kellogg

[1] http://www.w3.org/TR/csvw-ucr/

[2] http://www.w3.org/TR/tabular-data-model/



Cheers,
Achim


References

PHDD
http://rdf-vocabulary.ddialliance.org/phdd.html

http://ddi-alliance.org/Specification/RDF/PHDD


DDI main specifications
http://ddi-alliance.org/Specification/


DDI Alliance
http://ddi-alliance.org/

List of main DDI Adoptors
http://ddi-alliance.org/ddi-adopters



--
GESIS - Leibniz Institute for the Social Sciences
Department: Monitoring Society and Social Change
Team: Social Science Metadata Standards
Visiting address: B2 1, 68159 Mannheim, Germany
Postal address: P.O. Box 122155, 68072 Mannheim, Germany
Phone: +49 (0)621 1246 262
Fax: +49 (0)621 1246 100
E-mail: joachim.wackerow@gesis.org<mailto:joachim.wackerow@gesis.org>
www.gesis.org<http://www.gesis.org/>
Received on Thursday, 2 June 2016 06:28:09 UTC