Re: Formats, schemas, vocabularies, data models and section 7.4 of the Best Practices document from Laufer on 2015-01-22 (public-dwbp-wg@w3.org from January 2015)

From: Laufer <laufer@globo.com>
Date: Thu, 22 Jan 2015 06:53:51 -0200
To: Antoine Isaac <aisaac@few.vu.nl>
Cc: "public-dwbp-wg@w3.org" <public-dwbp-wg@w3.org>
Message-ID: <CA+pXJigzFeLfRmu_tr2+p5FKaYMXAKF6otT=tOe79exdsq+P9g@mail.gmail.com>
+1 to Antoine

Best Regards,
Laufer

Em quinta-feira, 22 de janeiro de 2015, Antoine Isaac <aisaac@few.vu.nl>
escreveu:

> Dear Joao Paolo, Carlos,
>
> I agree with your concerns. This have been voiced many times, and the
> 'technology neutral' focus does make it more visible, but, I think doesn't
> change much. The situation is a bit messy in the Linked Data world alone.
>
> I won't discuss data formats now, because that's not the point of 7.4 (it
> may be very useful to have the discussion for other sections though; I just
> don't have the time).
>
> My issue about what Joao Paolo describes as 'schemas' (and what suggests
> to call 'data models') is that it misses a part of what is called
> 'vocabularies' in the Linked Data world (and in other communities). Using
> the same point as in an earlier email: do you think that the ISO language
> codes are a schema (or a data model) of their own?
>
> In a previous group I was involved, on Library Linked Data, we faced a
> similar problem of naming things. We ended up with 'metadata element sets'
> for schemas/ontologies and 'controlled vocabularies' for thesauri, code
> lists etc.
> http://www.w3.org/2005/Incubator/lld/XGR-lld-vocabdataset/
> Note that we were facing then the need of being a bit more technology
> neutral: these 'controlled vocabularies' have existed way before RDF
> (porting them into RDF was actually why the SKOS 'schema' was created).
>
> Now, we may decide to rule 'vocabularies that don't qualify as data
> models' (like ISO language codes) from the best practices. I find it a bit
> a pity, because these are valuable artefacts, as the past decade on Linked
> Data has shown. And our current best practices apply to them, too.
>
> Back to the BP document now. From my past experience, we won't have time
> to fix this in two days. There are much easier and urgent issues to fix -
> *once* we have noted this vocabulary issue down for future resolution of
> course.
> Also, and because I've seen these discussions before, we probably won't
> find a good solution, i.e. we'll always have to exemplify the term we
> chose, as in "this section is about 'X', which gathers ontologies, schemas,
> relational models, etc".
>
> So what I suggest is to create an issue saying the the section needs
> terminological discussion and input, and maybe go as far as removing the
> 'controlled vocabularies' from the picture. Is it alright?
>
> Best,
>
> Antoine
>
> On 1/22/15 5:01 AM, Joao Paulo Almeida wrote:
>
>> Dear All,
>>
>> I think that we have reached a crucial point in the discussions around
>> the Best Practices document.
>>
>> Many have raised the concern that the term "vocabulary" may be a problem
>> in the document, in part because of its lack of precision and in part
>> because it is biased towards the RDF(S)/OWL(S) technological space.
>>
>> I completely agree with that, and we need to do our best to ensure
>> precision and to be agnostic with respect to the various technological
>> spaces.
>>
>> The problem has also appeared in the discussion surrounding the term
>> "format", which I also believe is problematic if not properly defined and
>> qualified. (and also the term "schema" and the other terms used in section
>> 7.4 of the BP document).
>>
>> So, this is a call for the group to settle on some concepts (and
>> ultimately terms) that should help us to structure our discussions,  give
>> us a basis to communicate and help our audience to understand us.
>>
>> I offer here a sketchy initial attempt; I'm hoping (fingers crossed) not
>> to incite a terminological debate, but a conceptual one... As long as we
>> agree on the concepts, we can always adjust the terms to make this more
>> intuitive to the majority of the people in our audience.
>>
>> Some of it is inspired in [1] to avoid re-inventing the wheel. (I wanted
>> to, but did not manage to touch upon the "metadata" and "ontology" terms. I
>> also did not manage to link OWL and SKOS into this.) And remember, this is
>> just a starting point.
>>
>> regards,
>> João Paulo
>>
>> ----
>>
>> By "data representation" we mean any convention for the arrangement of
>> symbols in such a way as to enable information to be encoded by a data
>> producer and later decoded by data consumers.
>>
>> A particular convention for data representation is often referred to as a
>> "data format".
>>
>> Adapted from [1]:
>>
>>     In existing computer systems there is typically a long chain of
>> relations connecting the physical phenomena by which data are represented
>> with the data being represented. Each link in the chain connects two layers
>> of representation: each layer organizes information available at the next
>> lower level into structures at a higher (or at least different) layer of
>> abstraction, and in this way provides information used in turn by the next
>> higher level in the representation.
>>
>>     For example, the representation of an email message may involve the
>> following layers:
>>
>>     Physical layer: holes in cards or tape, magnetic charges, color
>> changes on optical disks or scan codes, tones on a telephone connection, or
>> similar phenomena are interpreted as representing sequences of bits.
>>
>>     Bit layer: those sequences of bits may be interpreted as
>> representations of other different sequences of bits (for example five bits
>> may be written to the physical medium to represent four bits of data, in
>> such a way as to guarantee a minimum and maximum amount of space between
>> magnetic flux events in the media).
>>
>>     Byte / octet layer: the sequences of bits read from the storage
>> device are grouped into octets: units of eight bits often referred to as
>> bytes.
>>
>>     Character layer: an octet sequence may be interpreted as a sequence
>> of characters as defined by the appropriate character-set standard.
>>
>>     Application-specific data structure layer: the email reader will read
>> the character stream and distinguish the mail header from the message body,
>> and may distinguish multiple alternative representations of the message and
>> attachments within the message body. Within the mail header, mail software
>> will distinguish important fields like date, sender, and addressee.
>>
>>
>> We assume here that this group is concerned with data representation
>> beyond the octet layer, not concerning itself with "data formats" for
>> physical, bit and byte level data representation. Data representation at
>> the lowest level in this context is thus the octet sequence.
>>
>> Different applications will almost always have different
>> application-specific data structures. The variety of applications and uses
>> of the data on the web leads to an unbounded number of data formats.
>>
>> The need to support the definition of suitable data formats for data
>> interchange on the web has led to the development of languages and
>> frameworks for families of formats, examples of which include XML, SGML,
>> JSON and RDF.
>>
>> (Here it is important to note that we should avoid saying that data is
>> represented in XML or RDF - but instead, we should say that data is
>> represented in an XML-based format, or in an RDF-based format. So XML data
>> is data represented in an XML-based format, RDF data is data represented in
>> an RDF-based format.)
>>
>> These languages  and frameworks ultimately establish conventions to
>> encode data into sequences of octets. These conventions are often called
>> "serialization formats" or "serialization syntaxes" (e.g., [TURTLE],
>> [RDF11-XML], [JSON-LD]). In addition, these languages often establish a
>> "data model" or "abstract syntax" (e.g., [RDF11-CONCEPTS]) which define the
>> structure of data independent of a particular serialization format.
>>
>> Some of these families of formats are accompanied by languages or
>> (meta-)formats to specify a format, to enable some level of automation for
>> processing data in the format. For example, an XML-based format can be
>> specified with a "schema document" in the XML Schema Definition language,
>> enabling XML documents to be checked for conformance to the format defined
>> in the schema document [XML-SCHEMA]. Likewise, an RDF-based format can be
>> specified using RDF Schema [RDF11-SCHEMA].
>>
>> These "schemas" are often used as a means to anchor natural language
>> descriptions to guide humans in the interpretation of data produced using
>> the format. Often, labels are used in these schemas to convey intuitive
>> meaning and guide interpretation, in which case these labels serve the role
>> of "terms" in communication. The collection of terms as used in the schema
>> is then referred to as a "vocabulary".
>>
>>
>> Some requirements (adapted from [1]):
>>
>> Any data representation relied on for interoperability must have clear,
>> well written, published documentation. If the format is not documented, the
>> likelihood that the information it represents can be recovered without loss
>> is small.
>>
>> The specification documents for data formats should be controlled by
>> public bodies, preferably consensus-based organizations in the
>> international standardization system or by relevant industry consortia.
>>
>>
>> [1] C. M. Sperberg-McQueen, . David Dubin. "Data Representation", DH
>> Curation Guide: a community resource guide to data curation in the digital
>> humanities, http://guide.dhcuration.org/representation/.
>>
>> [RDF11-SCHEMA]
>> Dan Brickley, R. V. Guha. RDF Schema 1.1. W3C Recommendation, 25 February
>> 2014. URL: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/. The
>> latest published version is available at http://www.w3.org/TR/rdf-schema/
>> .
>>
>> [RDF11-XML]
>> Fabien Gandon, Guus Schreiber. RDF 1.1 XML Syntax. W3C Recommendation, 25
>> February 2014. URL: http://www.w3.org/TR/2014/REC-
>> rdf-syntax-grammar-20140225/. The latest published version is available
>> at http://www.w3.org/TR/rdf-syntax-grammar/.
>>
>> [TURTLE]
>> Eric Prud'hommeaux, Gavin Carothers. RDF 1.1 Turtle: Terse RDF Triple
>> Language. W3C Recommendation, 25 February 2014. URL:
>> http://www.w3.org/TR/2014/REC-turtle-20140225/. The latest edition is
>> available at http://www.w3.org/TR/turtle/
>>
>> [OWL2-OVERVIEW]
>> W3C OWL Working Group. OWL 2 Web Ontology Language Document Overview
>> (Second Edition). 11 December 2012. W3C Recommendation. URL:
>> http://www.w3.org/TR/owl2-overview/
>>
>> [JSON-LD]
>> Manu Sporny, Gregg Kellogg, Markus Lanthaler, Editors. JSON-LD 1.0. 16
>> January 2014. W3C Recommendation. URL: http://www.w3.org/TR/json-ld/
>>
>> [XML-SCHEMA]
>> XML Schema: Primer
>> World Wide Web Consortium. XML Schema Part 0: Primer Second Edition, ed.
>> Priscilla Walmsley and and David C. Fallside.W3C Recommendation 28 October
>> 2004. See http://www.w3.org/TR/xmlschema-0/
>>
>
>

-- 
.  .  .  .. .  .
.        .   . ..
.     ..       .
Received on Thursday, 22 January 2015 08:54:24 UTC