W3C home > Mailing lists > Public > public-csv-wg@w3.org > February 2016

Fwd: Cleaning up the requirements in the UCR doc

From: Jeremy Tandy <jeremy.tandy@gmail.com>
Date: Tue, 09 Feb 2016 16:49:45 +0000
Message-ID: <CADtUq_3+1QrJe5qoV+Q7UFMAdcT2O=Xu6WqQsMkfxrkjUd8Nzg@mail.gmail.com>
To: Jeni Tennison <jeni@theodi.org>, Dan Brickley <danbri@google.com>, Gregg Kellogg <gregg@greggkellogg.net>, Ivan Herman <ivan@w3.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Hi- I've not been able to find 'spare' time during my current workshop to
update the UCR document ... and am well past the limit of useful
productivity today.

I think there's a WG call planned for tomorrow. Again, I must miss it - but
... a special request.

Please can you have a quick skim through the classification of the
requirements I provided (see attached email) to see if you agree. Also
interested to know if there are particular aspects of the REC / NOTE
deliverables that you think that I must reference in the requirement.

My plan is to update the UCR doc on Thursday when I am working from home

I will update the "changes since last release" section too.

Hopefully, given that the changes are small (and the rough outline of what
I intend to say is provided in the attached email) we can still vote to
release the UCR doc a week tomorrow.

BR, Jeremy

---------- Forwarded message ---------
From: Jeremy Tandy <jeremy.tandy@gmail.com>
Date: Sat, 6 Feb 2016 at 16:32
Subject: Cleaning up the requirements in the UCR doc
To: public-csv-wg@w3.org <public-csv-wg@w3.org>

Hi. You'll see that I've updated ISSUE #539
<https://github.com/w3c/csvw/issues/539>, identifying which candidate
requirements should be accepted, and those which should be marked as

Please shout if you think my classification should be changed :-)

I intend to update the UCR document over the next few days in time for the
call on Wednesday (hoping I can find time around the edges of another WG
meeting I'm attending next week!)

I will not be able to make the call myself ...

FWIW, I've included my notes relating to each requirement so that you will
get an idea of the amendments I'm planning to make.



*CSVW Requirements cross reference*

3. Requirements

   - 3.1 Accepted Requirements
      - 3.1.1 Requirements relating to parsing of CSV


*Ability to determine that a CSV should be rendered using RTL column
ordering and RTL text direction in cells.*


[It is possible to set the column direction using the tableDirection
<http://w3c.github.io/csvw/metadata/#tableDirection> property and the text
direction on columns using the textDirection
<http://w3c.github.io/csvw/metadata/#cell-textDirection> property, as
defined in [tabular-metadata

   - 3.2 Candidate Requirements
      - 3.2.1 Requirements relating to applications


*Ability to validate a CSV for conformance with a specified metadata


[comment on validating tables; table compatibility (correct number of
non-virtual columns, matching names/titles for those columns where
specified in header row), primary key uniqueness, missing foreign key
references, cell validation]

[comment on validating cells; parsing cells (_datatype_ parsing), length
constraints and value constraints]


*Ability to transform a CSV into RDF*


[comment that [CSV2RDF] specifies the transformation of an annotated table
to RDF; providing both _minimal mode_, where RDF output includes triples
derived from the data within the annotated table, and _standard mode_,
where RDF output additionally includes triples describing the structure of
the annotated table.]

[comment that built-in types are limited to those defined in
[tabular-data-model] 4.6 Datatypes; geo:wktLiteral and other types from
[geosparql] are not supported natively.]


*Ability to transform a CSV into JSON*


[comment that [CSV2JSON] specifies the transformation of an annotated table
to JSON; providing both _minimal mode_, where JSON output includes objects
derived from the data within the annotated table, and _standard mode_,
where JSON output additionally includes objects describing the structure of
the annotated table. Built-in datatypes from the annotated table, as
specified in [tabular-data-model] 4.6 Datatypes, are converted to JSON
primitive types.]


*Ability to transform a CSV into XML*


[The charter of the Working Group (
http://www.w3.org/2013/05/lcsv-charter.html) includes a work item for CSV
to XML conversion. Given that there is only a single use case providing
motivation for this requirement, and that the Working Group was unable to
find XML experts to assist in delivery of this work item, the Working Group
were forced to abandon this deliverable.]


*Ability to transform CSV conforming to the core tabular data model yet
lacking further annotation into a object / object graph serialisation*


[comment that an annotated table is always generated by applications
implementing this specification when processing tabular data; albeit that
those annotations are limited. The _titles_ annotation may be populated
from column headings provided within the tabular data file. Transformations
to both RDF and JSON operate on the annotated table, and are, therefore,
unaffected by the use of a tabular metadata file to provide additional


*Ability to publish metadata independently from the tabular data resource
it describes*


[comment that [tabular-metadata] specifies the format and structure of a
metadata file that may be used to provide annotations on an annotated table
or group of tables.]


*Ability to define a property-value pair for inclusion in each row.*


[comment that to meet this requirement a _virtual column_ must be specified
for the additional property-value pair that is to be included in each row.
The _default_ annotation may be used to provide the value for every row, or
the _value URL_ annotation may be used to specify a URI Template, as
defined in [RFC6570], that is evaluated for each row]


*Ability to apply conditional processing based on the value of a specific


[comment on use of _transformation definitions_ that define how a script or
template may be used to provide such conditional processing; also that the
output from JSON or RDF transformation may be subjected to post-processing
to achieve the desired outcome. Details of these transformation scripts /
templates and post processing is outside the scope of this specification]


*Ability to identify comment lines within a CSV file and skip over them
during parsing, format conversion or other processing*

[DEFERRED … non-normative]

[use of _comment prefix_ as specified within a _dialect description_;
default is “#” … a _dialect description_ provides ‘hints’ to parsers about
how to process the tabular data file]

   - 3.2.2 Non-functional requirements


*Ability to add supplementary metadata to an existing CSV file without
requiring modification of that file*


[comment on use of complementary metadata document containing annotations
for tabular data; as specified in REC-metadata … Applications MAY provide
alternative mechanisms to gather the annotations on an _annotated table_ or
_group of tables_]

   - 3.2.3 Data Model Requirements


*Ability to parse internal data structure within a cell value*

[ACCEPTED? … only lists]

[comment that support is provided for validating the format of cell values
… R-SyntacticTypeDefinition:

   - _Parsing Cells_: formats for numeric types (decimalChar, groupChar,
   pattern), formats for booleans, formats for dates and times, formats for
   - formats for other types (e.g. html, json, xml and well known text
   literals ‘WKT”) can be validated using a regular expression for the string
   values, with syntax and processing defined by [ECMASCRIPT

[comment that only limited support is provided for extracting values from
structured data within cells; the parsing html, json and xml etc. to
extract structured data is not support; lists of values provided in a
single cell are processed into arrays wherein each array item is considered
to be of consistent type]

[comment that list items in a given cell value are separated by the
_separator_ character specified in the _dialect description_]


*Ability to parse tabular data with cell delimiters other than comma (,)*

[DEFERRED … non-normative]

[use of _delimiter_ as specified within a _dialect description_; default is
“,” … a _dialect description_ provides ‘hints’ to parsers about how to
process the tabular data file]


*Ability to determine the primary key for rows within a tabular data file*


[comment on use of the _primaryKey_ annotation; a primary key may be
compiled from multiple cell values in a given row]


*Ability to cross reference between CSV files*


[comment on use of the _foreign key_ annotation on an annotated table for
validation purposes; any cell value in a column referenced by the foreign
key statement must have a unique value in the column of the referenced

[comment that references between resources may be asserted, irrespective of
whether the resource is listed elsewhere in another table, may be created
by converting local identifiers into URIs using URI templates; the _value
URL_ annotation can be used to refer to a resource and the _about URL_
annotation used to identify a resource. Referenced resources do not need to
be specified in an annotated table at all]


*Ability to add annotation and supplementary information to CSV file*


[comment that any annotation may be used in addition to _core annotations_
specified in this specification, such as title, author, license etc.; these
are referred to as _common properties_; see 5.8 Common Properties for more

[comment on use of _notes_ annotation for tables and groups of tables;
these may be used to provide any number of additional annotations for a
table or group of tables; such annotations are interpreted in the same way
as _common properties_]

[comment that the Web Annotation Working Group
<http://www.w3.org/annotation/> is developing a vocabulary for expressing
annotations; for example, see CSV2RDF 7.2 Example with single table and
rich annotations]


*Ability to associate a code value with externally managed definition*


[comment that an identifier referenced a cell value may either be mapped to
a URL that can be resolved to provide a definition for the identified
resource, or a foreign key reference can be asserted to another table
published in the same group of tables where the definition associated with
the identifier could be provided]


*Ability to assert how a single CSV file is a facet or subset of a larger


[comment that this specification does not provide any description of the
relationship between tables beyond their membership in a given _group of
tables_; other specifications such as [RDF Data Cube] and [VoID] provide
mechanisms to describe subsets of data that may be of use in meeting this
requirement. Such descriptions can be included as metadata annotations in
the form of _notes_ or _common properties_]


*Ability to declare syntactic type for cells within a specified column.*


[comment that syntactic type for a cell value is defined using the
_datatype_ annotation; built-in datatypes include those defined in [
plus number, binary, datetime, any, xml, html and json. Datatypes can be
derived from the built-in datatypes using further annotations; refer to
5.11.2 Derived datatypes for further details]


*Ability to declare semantic type for cells within a specified column.*


[comment that the identifier for the semantic type associated with a given
cell value can be specified using the _property URL_ annotation (a URI
template property); this is normally specified for the column and inherited
by all the cells within that column]


*Ability to declare a "missing value" token and, optionally, a reason for
the value to be missing*


[comment that the string (or strings) representing missing values in an
annotated table is defined using the _null_ annotation]


*Ability to map cell values within a given column into corresponding URI*


[comment that a URI Template, as defined in [RFC 6570], can be specified to
map the value of a cell to a URI using the _value URL_ annotation]


*Ability identify/express the unit of measure for the values reported in a
given column.*

[<< requirement needs additional description >>]


[comment that this specification provides no native mechanism for
expressing the unit of measurement associated with values of cells in a
column; for example, stating that the floating-point numbers in a column
with name “distance” are provided in kilometers. However, annotations may
be used to provide this additional information. The [CSVW Primer] provides
examples of how this may be achieved (
from providing descriptive metadata to enabling transformation of cell
values to structured data with unit of measurement statements. The [RDF
Data Cube vocabulary] provides another alternative for annotations;
structural metadata is used to provide metadata to interpret data values -
such as the unit of measurement.]


*Ability to group multiple data tables into a single package for


[comment that _group of tables_ (
is a first class entity within the tabular data model; comprising a set of
annotated tables and a set of annotations that relate to that group of


*Ability for a metadata description to explicitly cite the tabular dataset
it describes*


[comment that in addition to providing mechanisms to locate metadata
relating to a tabular data file, see [tabular-data-model]
(#locating-metadata), the table annotation _url_ allows the URL of the
source of the data in the annotated table to be defined; for example,
referring to a specific CSV file]


*Ability to declare a locale / language for content in a specified column*


[comment that the annotation _lang_ may be used to express the code for the
expected language for values of cells in a particular column, expressed in
a format defined by [BCP47
Furthermore, the annotation _titles_ allows for any number of
human-readable titles to be given for a column, each of which may have an
associated language code as defined by [BCP47


*Ability to provide multiple values of a given property for a single entity
described within a tabular data file*


[comment that within an annotated tables, the values of cells can be
considered as RDF subject-predicate-object triples [rdf11-concepts
The annotation _about URL_ may be used to define the subject of the triple
derived from a cell, and, where the same _about URL_ annotation is used for
every cell within a row, the resource identified by the _about URL_
annotation can be considered to be the subject of the row. The same _about
URL_ annotation may be used to describe cells in more than one row.
Similarly, the _property URL_ annotation may be used to define the
predicate of the triple. The same _property URL_ annotation may be used to
describe multiple columns, meaning that multiple values of a property for
may be provided from a series of columns.]

[comment that arrays of values may be supplied within a cell value; values
in the array are delimited using the character specified using the
_separator_ annotation within a _dialect description_]

   - 3.3 Deferred requirements
      - 3.3.1 Requirements relating to parsing of CSV


*Ability to determine that a CSV is syntactically well formed*



*Ability to handle headings spread across multiple initial rows, as well as
to distinguish between single column headings and file headings.*



*Ability to transform data that is published in a normalized form into
tabular data.*


   - 3.3.2 Requirements relating to applications


*Ability to access and/or extract part of a CSV file in a non-sequential

Received on Tuesday, 9 February 2016 16:50:32 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 9 February 2016 16:50:33 UTC