QT4 CG draft minutes 041, 11 July 2023 from Norm Tovey-Walsh on 2023-07-11 (public-xslt-40@w3.org from July 2023)

From: Norm Tovey-Walsh <norm@saxonica.com>
Date: Tue, 11 Jul 2023 17:28:50 +0100
To: public-xslt-40@w3.org
CC: Matthew Patterson <matt@saxonica.com>
Message-ID: <m2o7kiv2a7.fsf@saxonica.com>

Hello folks,

Here are the minutes. Find also attached, a copy of Matt’s slide deck.

  https://qt4cg.org/meeting/minutes/2023/07-11.html

QT4 CG Meeting 041 Minutes 2023-07-11

Table of Contents

     * [1]Draft Minutes
     * [2]Summary of new and continuing actions [0/5]
     * [3]1. Administrivia
          + [4]1.1. Roll call [11/11]
          + [5]1.2. Accept the agenda
               o [6]1.2.1. Status so far...
          + [7]1.3. Approve minutes of the previous meeting
          + [8]1.4. Next meeting
          + [9]1.5. Review of open action items [1/6]
          + [10]1.6. Review of open pull requests and issues
     * [11]2. Technical Agenda
          + [12]2.1. PR #533: 413: Spec for CSV parsing with
            fn:parse-csv()
     * [13]3. Any other business?
     * [14]4. Adjourned

   [15]Agenda index / [16]QT4CG.org / [17]Dashboard / [18]GH Issues /
   [19]GH Pull Requests

Draft Minutes

Summary of new and continuing actions [0/5]

     * [ ] QT4CG-002-10: BTW to coordinate some ideas about improving
       diversity in the group
     * [ ] QT4CG-016-08: RD to clarify how namespace comparisons are
       performed.
     * [ ] QT4CG-026-01: MK to write a summary paper that outlines the
       decisions we need to make on "value sequences"
          + This is related to PR #368: Issue 129 - Context item
            generalized to context value and subsequent discussion.
     * [ ] QT4CG-029-07: NW to open the next discussion of #397 with a
       demo from DN See PR [20]#449
     * [ ] QT4CG-039-01: NW to schedule discussion of issue [21]#52, Allow
       record(*) based RecordTests

1. Administrivia

1.1. Roll call [11/11]

     * [X] Reece Dunn (RD)
     * [X] Sasha Firsov (SF)
     * [X] Christian Gr¸n (CG)
     * [X] Joel Kalvesmaki (JK) [0:05-]
     * [X] Michael Kay (MK)
     * [X] John Lumley (JL)
     * [X] Dimitre Novatchev (DN)
     * [X] Ed Porter (EP)
     * [X] C. M. Sperberg-McQueen (MSM)
     * [X] Norm Tovey-Walsh (NW). Scribe. Chair.
     * [X] Matt Patterson (MP)

1.2. Accept the agenda

   Proposal: Accept [22]the agenda.

   Accepted.

1.2.1. Status so far...

   issues-open-2023-07-11.png

   Figure 1: "Burn down" chart on open issues

   issues-by-spec-2023-07-11.png

   Figure 2: Open issues by specification

   issues-by-type-2023-07-11.png

   Figure 3: Open issues by type

1.3. Approve minutes of the previous meeting

   Proposal: Accept [23]the minutes of the previous meeting.

   Accepted.

1.4. Next meeting

   The next meeting [24]is scheduled for Tuesday, 18 July 2023.

   No regrets heard.

   Reminder: the CG will take a vacation for four weeks in August. We will
   not meet on 1, 8, 15, or 22 August.

1.5. Review of open action items [1/6]

     * [ ] QT4CG-002-10: BTW to coordinate some ideas about improving
       diversity in the group
     * [ ] QT4CG-016-08: RD to clarify how namespace comparisons are
       performed.
     * [ ] QT4CG-026-01: MK to write a summary paper that outlines the
       decisions we need to make on "value sequences"
          + This is related to PR #368: Issue 129 - Context item
            generalized to context value and subsequent discussion.
     * [X] QT4CG-029-01: RD+DN to draft spec prose for the "divide and
       conquer" approach outlined in issue #399
          + Overtaken by events.
     * [ ] QT4CG-029-07: NW to open the next discussion of #397 with a
       demo from DN See PR [25]#449
     * [ ] QT4CG-039-01: NW to schedule discussion of issue [26]#52, Allow
       record(*) based RecordTests

1.6. Review of open pull requests and issues

   The following editorial or otherwise minor PRs were open when this
   agenda was prepared. The chair proposes that these can be merged
   without discussion.
     * PR [27]#597 : Editorial fixes from #566 (fn:parse-uri)
          + Check for technical comments from CG
     * PR [28]#595 : 588: (Editorial, XSLT) minor clarifications regarding
       xsl:sort
     * PR [29]#594 : 592: (XSLT, Editorial) Add missing description of
       exponent-separator
     * PR [30]#593 : 591: [XSLT, editorial] Add defaults to XSLT element
       syntax summaries
     * PR [31]#590 : 343: make $collation uniformly optional
     * PR [32]#587 : 365: Allow braces in switch and typeswitch
       expressions
     * PR [33]#586 : 585: [Editorial] Rearrange text (and grammar) for
       dynamic function calls
     * PR [34]#584 : Editorial: Correction to map:filter examples
     * PR [35]#578 : fn:format-integer: $lang -> $language
     * PR [36]#577 : Editorial: improve generator for keyword tests
     * PR [37]#555 : 464: Revised narrative of normalization steps for
       serialization
     * PR [38]#547 : Action QT4CG-036-02: Further elaboration of the rules
       for function identity

   After discussion, #598 removed.

   Proposal: Accept these PRs.

   Accepted.

   It has been proposed that the following issues be [39]closed without
   action.
     * Issue [40]#539 FLOWR where clause with a "do when false" option

   Proposal: Close these issues.

   Accepted.

2. Technical Agenda

2.1. PR #533: 413: Spec for CSV parsing with fn:parse-csv()

     * See PR [41]#533
     * MP introduces the changes proposed with a slide deck
          + ... (Walks through slide deck)
     * RD: Why is there only a record for the top level?
          + MP: So it fits on a single slide; also I have questions about
            how to define nested records. Also, I have some questions
            about where record types are shared.
     * MP continues...
     * MSM: Trim trims only leading and trailing whitespace, I assume?
     * MP: Yes.
     * MP continues...
          + ... Extract column names from the first row: boolean or a map
            from integer to string to specify headers for the columns.
          + ... There's an option to filter columns.
          + ... You can specify that the number of columns can be fixed.
            They're padded or truncated.
     * MSM: If I say nothing?
     * MP: Then you get what you get?
     * JL: Is there an argument for filter rows?
     * MP: There isn't, and I haven't thought of a use for it beyond
       removing say the first "n" rows. You probably want to evaluate each
       row programmatically. Columns are relatively fixed, unlike rows.
     * JL: I might just want to test on the first 25 or 40 rows. Some
       mechanism that allowed me to truncate parsing might be handy.
     * MP: Yes, I think one of the reasons for using a sequence of rows is
       that it's easier to generate lazily. And we have a large number of
       good tools for extracting "n" rows from a sequence.
     * DN: Whenever I see arguments for options, my question is always, is
       this a mandatory argument? If it's not, what are the defaults?
     * MP: The default is to extract column names from the first row, to
       not filter columns, and not to restrict the number of columns.
     * MSM: I'd like a way to specify the default behaviors explicitly.
     * MP: I'm not sure I have the notation correct, but that's what
       you're supposed to be able to do here.

   Some discussion of the possible details around specifying defaults,
   with enumerated values for example. Whether a keyword is necessary or
   if an empty sequence suffices is something of an open question.
     * DN: I would like to see exactly these cases in examples in the
       spec.
     * MP: Yes, exactly.
     * MP continues...
          + You can supply column names reliably even if the data doesn't
            include them.
     * JL: I think it's important that if the boolean in false, the first
       row becomes a header row. That needs to be explicit.
     * MP: Yes.
     * MP continues...
          + filter-columns and number-of-columns ...
          + MP discusses the example on the slide titled "Using
            csv-to-xdm()'s response".
          + ... I have questions about how best to deal with namespaces
            and cross references.
     * JL: The rows are all siblings of each other, but their position
       isn't the same as the row position. Having a rows wrapper would
       make it more straightforward.
     * MP: That makes sense.
     * EP: You can supply a boolean or a map. Can you override the
       headers? So you want to specify "true" but also specify your own
       set.
     * MP: Yes. I'm not sure. I think there's an argument that you can
       handle that the same way you'd handle the not uncommon case where
       there are several rows of header-like data. But maybe there needs
       to be another option...
     * MSM: I like the idea of saying just apply tail to the sequence of
       rows in that case.
     * EP: Yes, that would work. I was just pointing out that the way the
       option is specified, you can't do both.
     * MP continues with csv-to-xml()
          + In a namespace?
          + RD: I wonder if it should use the fn: namespace to be similar
            to how analyze-string works.

   Some discussion of how this compares to JSON. Consensus: there's a
   clear precedent, use the `fn:` namespace.
     * MP: The last question is about how to map between fields and column
       headers. Either you have id/ref pairs, or you can use the column.
     * JK: Why can't we just rely on position?
     * MP: You could rely on positionality, but if you have a CSV with 50
       or 60 columns and you want the ones with the "name" and the
       "amount" then names are better than "columns 2 and 35".
     * NW: My preference is the id/ref version.
     * MSM: I don't understand why. My gut reaction is "what I'm used to
       and what I'm happy with is to have the column names used as element
       names." That makes processing the result feel a lot more
       convenient.

   Some discussion of whether or not column names are likely to contain
   strings that don't match cleanly to attribute or element names.
     * MP: The other argument is that if you have large, long column names
       then you're adding a lot more data into each row.
     * MSM: I think relying on position would make sense if people are
       worried about data size. The added indirection of having to keep a
       table and have a lookup the name from the ID doesn't appeal to me.
     * MK: (in the chat window) I think the id/idref approach is an
       unnecessary extra level of indirection.
     * MP: My goto would be to work with the XDM directly, so maybe my
       opinion isn't as relevant.
     * MK: I also think if you're worried about space, the number of
       attributes is probably more significant than the length of them.
     * MP continues with "~fn:parse-csv()~ output"
          + It handles quoting and delimiters. You can build anything you
            want from that without having to reimplement the parsing
            constraints.
     * JL: Isn't there an argument that this one says gives you the header
       rows?
     * MP: Yes.
     * JL: Then the example could be clearer.
     * MP: Yes.
     * RD: Given that fn:parse-csv is now simple, would it make sense to
       have the inverse, "serialize-csv"?
     * MP: Yes, I'm hoping to add that. My rough thinking is that you want
       a function the generates the field values with quoting and the
       rows.
          + ... The record on the "Input options" slide is what you'd had
            to these functions.
          + All the information you'd need to generate them is in there.
     * MP continues with " fn:parse-csv data input"
          + The problem with unparsed-text-lines is that it strips the
            line endings. We can't be sure there's a 1:1 correspondence
            between a row and a line in a file.
     * CG: We have parse-json and json-doc, maybe it would be reasonable
       to have parse-csv and csv-doc for that purpose?
     * MP: Yep.
     * MK: (in the chat window) Can't we just let the optimizer cope with
       streaming the combination of unparsed-text() => parse-csv() ?
     * MK: Maybe. I don't know.
     * MSM: I think I understand what MP is driving at, but I'm a little
       confused by some details.
          + If I'm understanding correctly, in the simple case, the lines
            of the CSV file and the records in the records are 1:1, but
            that's not always the case.
     * MP: Yes.
     * MSM: And the case in which that's not true is the case where there
       may be multiple lines. It's 1:n not m:1. Right?
     * MP: Yes.
     * MSM: So if we want unparsed-text-lines() to be usable this way, we
       have to be able to specify that you can begin a multi-line quote in
       one string and finish it later.

   Some discussion of the problems associated with multi-line fields. If
   the line ending is stripped away by the uparsed-text-lines() function,
   then you'll loose information. It might be important that the embedded
   line ending was CR/LF and not just LF.
     * MSM: I'm willing to say that is a corner case that may arise and
       when it does, you'll want to parse it yourself.
     * MP: There's a larger question of dealing with error handling.
     * JL: We know that parse-csv() is doing something internally that is
       like unparsed-text-lines(). So you don't gain anything by using
       unparsed-text-lines().
     * MSM: I'm guessing about what the JSON parsing functions do.
     * RD: It would be useful to add these corner cases as tests in the
       test suite.

3. Any other business?

   None heard.

4. Adjourned

References

   1. https://qt4cg.org/meeting/minutes/2023/07-11.html#minutes
   2. https://qt4cg.org/meeting/minutes/2023/07-11.html#new-actions
   3. https://qt4cg.org/meeting/minutes/2023/07-11.html#administrivia
   4. https://qt4cg.org/meeting/minutes/2023/07-11.html#roll-call
   5. https://qt4cg.org/meeting/minutes/2023/07-11.html#agenda
   6. https://qt4cg.org/meeting/minutes/2023/07-11.html#so-far
   7. https://qt4cg.org/meeting/minutes/2023/07-11.html#approve-minutes
   8. https://qt4cg.org/meeting/minutes/2023/07-11.html#next-meeting
   9. https://qt4cg.org/meeting/minutes/2023/07-11.html#open-actions
  10. https://qt4cg.org/meeting/minutes/2023/07-11.html#open-pull-requests
  11. https://qt4cg.org/meeting/minutes/2023/07-11.html#technical-agenda
  12. https://qt4cg.org/meeting/minutes/2023/07-11.html#pr-533
  13. https://qt4cg.org/meeting/minutes/2023/07-11.html#any-other-business
  14. https://qt4cg.org/meeting/minutes/2023/07-11.html#adjourned
  15. https://qt4cg.org/meeting/minutes/
  16. https://qt4cg.org/
  17. https://qt4cg.org/dashboard
  18. https://github.com/qt4cg/qtspecs/issues
  19. https://github.com/qt4cg/qtspecs/pulls
  20. https://qt4cg.org/dashboard/#pr-449
  21. https://github.com/qt4cg/qtspecs/issues/52
  22. https://qt4cg.org/meeting/agenda/2023/07-11.html
  23. https://qt4cg.org/meeting/minutes/2023/07-11.html
  24. https://qt4cg.org/meeting/agenda/2023/07-18.html
  25. https://qt4cg.org/dashboard/#pr-449
  26. https://github.com/qt4cg/qtspecs/issues/52
  27. https://qt4cg.org/dashboard/#pr-597
  28. https://qt4cg.org/dashboard/#pr-595
  29. https://qt4cg.org/dashboard/#pr-594
  30. https://qt4cg.org/dashboard/#pr-593
  31. https://qt4cg.org/dashboard/#pr-590
  32. https://qt4cg.org/dashboard/#pr-587
  33. https://qt4cg.org/dashboard/#pr-586
  34. https://qt4cg.org/dashboard/#pr-584
  35. https://qt4cg.org/dashboard/#pr-578
  36. https://qt4cg.org/dashboard/#pr-577
  37. https://qt4cg.org/dashboard/#pr-555
  38. https://qt4cg.org/dashboard/#pr-547
  39. https://github.com/qt4cg/qtspecs/labels/Propose%20Closing%20with%20No%20Action
  40. https://github.com/qt4cg/qtspecs/issues/539
  41. https://qt4cg.org/dashboard/#pr-533

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Attachments

application/pdf attachment: parse-csv-update-2023-07-11.pdf

Received on Tuesday, 11 July 2023 16:30:56 UTC