Re: [community-group] Split units and values in type definitions (#121) from Romain Menke via GitHub on 2022-12-11 (public-design-tokens-log@w3.org from December 2022)

From: Romain Menke via GitHub <sysbot+gh@w3.org>
Date: Sun, 11 Dec 2022 22:01:31 +0000
To: public-design-tokens-log@w3.org
Message-ID: <issue_comment.created-1345668396-1670796090-sysbot+gh@w3.org>
I would strongly advice not to use regexp for this, not ever for a reference implementation.
I can not stress this enough, a parser is not a regexp and a regexp is not a parser.

_If I had a dime for each bug that existed because someone used regexp where they should have used a true tokenizer and parser :)_

---------

The example with `10 horses` was meant to illustrate a true string that kind of looks like a value and a unit but most definitely is not.

White space must not be allowed between a value and a unit.
Allowing this is highly unusual.

---------

> I agree that changing numbers from scientific notation is an extra step. I think we have to consider tradeoffs here: how often is someone copying-and pasting numbers into a token file from a tool that produces scientific notation? My estimation is that it is pretty rare. If we are making files harder to write and/or harder to parse to acommodate this workflow, is it worth it?

> Here's an updated proposal for the format based on these ideas:
> 
> The $value of a token that has units will always take the following format:

I think it is confusing that there are arbitrary differences between a JSON number and a number that is part of a dimension.

Why can't we adopt the definition of a number from JSON?

https://www.json.org/json-en.html

-----

A parsing algorithm can use look ahead because values are contained within JSON fields.

This algorithm assumes that the token type is `dimension`, `duration` or any of the other value + unit tuples.
It will not produce correct results if the token type can be anything else.

It requires that the allowed units are known.

1. compare the end of the string value with the known units
2. if the string value does not end in any of the known units
  2.a. this is parsing error
3. trim the unit from the string value
4. parse the remaining string value as JSON
5. if parsing as JSON fails
  5.a. this is parsing error
6. if the parsed value is not a number
  6.a this is parsing error
7. return the parsed value as the `value` and the found unit as `unit`

```js
function parseUnitAndValue(input, allowedUnits) {
 allowedUnits.sort((a, b) => b.length - a.length); // can be optimized by sorting outside this function

 // 1. compare the end of the string value with the known units
 const unit = allowedUnits.find((candidate) => {
  return input.slice(input.length-candidate.length) === candidate;
 });

 // 2. if the string value does not end in any of the known units
 if (!unit) {
  // 2.a. this is parsing error
  throw new Error('Parse error');
 }

 // 3. trim the unit from the string value
 let inputWithoutUnit = input.slice(0, input.length - unit.length);

 let value;
 try {
  // 4. parse the remaining string value as JSON
  value = JSON.parse(inputWithoutUnit);
 } catch (err) { // 5. if parsing as JSON fails
  // for debugging maybe?
  console.log(err);
  // 5.a. this is parsing error
  throw new Error('Parse error');
 }

 // 6. if the parsed value is not a number
 if (typeof value !== 'number') {
  // 6.a this is parsing error
  throw new Error('Parse error');
 }

 // 7. return the parsed value as the `value` and the found unit as `unit`
 return {
  value: value,
  unit: unit,
 };
}

console.log(parseUnitAndValue('-10rem', ['em', 'rem', 'px']));
```

_Implementers might prefer to use a regexp to find and trim the unit from the end._

-----

An algorithm of this kind has these benefits :
- simple to implement
- parity with regular numbers in JSON
- implementations do not need to write complete tokenizers and parsers

The downside is that it can only be applied if you already know that something is a unit and value tuple and what the allowed units are.

-----

If a field in a composite token allows multiple types then it requires a second algorithm.

For this theoretical composite type :

- type `composite-example`
- one field `foo`
- `foo` allows `string`, `dimension`, `number` values

valid:

```
{
  "alpha": {
    "$type": "composite-example",
    "$value": {
      "foo": 10
    }
  },
  "beta": {
    "$type": "composite-example",
    "$value": {
      "foo": "10"
    }
  },
  "delta": {
    "$type": "composite-example",
    "$value": {
      "foo": "10px"
    }
  },
}
```

1. if value is a number
  1.a. return this number
2. if value is a string
  2.a. try to parse as a `dimension`
  2.b. if parsing succeeds
    2.b.i. return the value/unit tuple
  2.c. return the string value

This works well up until a new unit is added to `dimension`.
Any tools that haven't been updated will parse new tokens as string.

--------------

This parsing algorithm doesn't have any issues for `dimension` or `duration` but it complicates error handling.

It fails to detect syntactically valid microsyntax with unknown units.

This affects the parsing of composite tokens.

------

A different approach would to require implementers to write a custom version of the JSON number parsing algorithm.

can be found here : https://www.json.org/json-en.html

1. parse a number exactly like JSON
2. if parsing a number failed
  2.a this is parsing error
3. trim the parsed number from the input
4. if the remainder is not one of the allowed units
  4.a. this is parsing error
5. return the parsed value as the `value` and the found unit as `unit`

This places a much higher burden on implementers but it is able to distinguish these :
- **valid** `<number><unit>` microsyntax and **known** unit
- **valid** `<number><unit>` microsyntax but **unknown** unit
- **invalid** `<number><unit>` microsyntax

Being able to make that distinction makes it much easier to extend composite types in the future. (in the specification, not in implementations)

--------

Simply splitting the unit and value has non of these issues, challengers or drawbacks.

-- 
GitHub Notification of comment by romainmenke
Please view or discuss this issue at https://github.com/design-tokens/community-group/issues/121#issuecomment-1345668396 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Sunday, 11 December 2022 22:01:33 UTC