Some comments on the UCR document

Hi guys,

now that the draft UCR document is almost published, I took some time this morning to make a more thorough reading through it. I found a number of minor issues, 99% editorial; I list them below. None of these are really serious (ie, no reason to bother for the publication of tomorrow), but you may want to take care of those for the next release.

As an overall comment, though, I am a little bit bothered by one thing: the very anglo-saxon oriented nature of the use cases. I would like to be sure that these use cases do not hide additional issues that may come up when using CSV in different other cultures. I could not put my finger to this, but I did ask myself questions like: we are talking about column headers, but we refer to that as the first column from the left; what happens with CSV files produced in arabic, hebrew, and other right-to-left writing systems? How do they do this in practice? Another issue is whether, for some writing systems that use vertical writing, is the role of the rows and the columns naturally transposed? We should remember that there are languages (as opposed to Chinese or Japanese) where vertical writing is THE writing mode, it is not an option like in CJK languages (e.g., Mongolian). I realize that these languages are in a strong minority, but nevertheless... Also, the "," character is not part of Arabic or CJK languages; the character that looks like a comma is actually a different code point. Do they use "," nevertheless?

I think the way to answer this is to try to get some use cases from Arabic, Hebrew, Chinese, Japanese, or Indian sources. Even if those are structurally simple, it may reveal some specificities. And again they may not, and I am proven wrong, but at least we checked! W3C has connections (hosts, offices) in some of those places that may be able to help if we find the right people. You guys may have other connections, too. I think it would be due diligence to try...

And... it is a great document! :-)

I attach the minor issues I found below.

Cheers

Ivan

- Section 3.3., bullet points after example 2, "e.g. 1901.04 – equivalent to January, 1901": shouldn't that be April 1901?

- In the examples (say, example 7 or 8) it is not absolutely clear where the beginning of the data set is; this is an artefact of the styling. Eg, in Example 7, is 'Post Unique reference', etc, the _first_ row in the CSV file, or are there (is it allowed to have) empty rows beforehand? The answer is obviously 'yes, it is the first row' in this case, but that may not alwasy be 100% obvious (e.g., Example 1: how many empty lines are there?). I guess, CSS-wise, a thin border around the data, or adding row numbers, or something similar, may help in avoiding any ambiguity.

- Section 3.6., first paragraph: isn't there a full stop missing after "Public Library of Science"?

- Section 3.6.: isn't it correct that this use case also requires "CsvAsSubsetOfLargerDataset"? At least this is what the second bullet item seems to suggest.

- Section 3.6.: (I am not sure it is really relevant) one of the text fields is actually not pure text, but a HTML snippet. What this tells me is that a type information making that clear may be useful (note that RDF has an HTML data type for such purposes). Maybe worth noting as a non-obvious micro syntax/format (ie, we are not only talking about numbers or dates)

- I know this may be controversial: the title of section 3.7 uses the word 'Analyses'. According to http://www.tysto.com/uk-us-spelling-list.html, this is British spelling. However, the official spelling for W3C documents should be American English, so shouldn't that be Analyzes? I am a bit out of my comfort zone here because, for a foreigner, the intricacies of British vs. American spellings are a mystery sometimes, so I may be wrong on that example, but I am sure about the overall statement on American English spelling for W3C documents. (B.t.w., the title of 3.8 uses "Analyzing" but uses "analyses" in the text:-)

- Section 3.7, after the bullet items following example 13: "data therein contained" -> "data contained therein" (I think)

- Section 3.9, second paragraph, "saved as csv files for each line": I guess we should CSV here (and elsewhere) to be consistent (I have not checked the file for other occurence of "csv" as opposed to "CSV")

- Section 3.10, first bullet item: "Eurozone in 2007, the implying currency is problematic" sounds a bit strange English-wise; should the "the" be dropped? Also "necessary to explicit the currency of each column" -> "necessary to make the currency of each column explicit"

- Section 3.10, second bullet item: "preferrable" -> "preferable"

- Section 3.16, first paragraph, has both NetCDF and netCDF. I am not sure which should be the canonical format, but we should be consistent


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
FOAF: http://www.ivan-herman.net/foaf

Received on Wednesday, 26 March 2014 12:32:00 UTC