- From: Ronald Mansveld <ronald@ronaldmansveld.nl>
- Date: Fri, 01 Nov 2013 17:47:42 +0000
- To: Doug Schepers <schepers@w3.org>
- Cc: <public-webplatform-tests@w3.org>, Janet Swisher <jswisher@mozilla.com>
It's OK. I've ran into Jean-Yves here at the London Office, and he brought me into contact with some of his american collegues. A bug has been filed to have the data be available as JSON, but it seems like their raw data are indeed the HTML-tables, so either way it would mean parsing that data. Either on their side, or on our side. There have been talks about extracting that data to machine-readable, but for now that's likely to be in the future. As for the MediaWiki extension: can you send me an example or spec of the precise JSON-formatting it expects? What might be a solution for now: - Use MediaWiki and JSON (and all benefits) when CIU/H5T data is available - Bypass MediaWiki and show the MDN table if no CIU/H5T data is available That way a lot of properties would still have the MW extension, just the entries that we don't have CIU/H5T data for would have to resort to the MDN-fallback. By simply looking at the analytics-data for the pages, we can always decide to manually provide MDN-JSON for pages with high request-rates, until a good parser has been written. I've come a long way parsing the MDN-data to JSON, the main problem is that some of the key-data is lumped together in 1 table-cell. So it's hard to extract that data in a correct way. I am trying however, just not sure about the right way to do so. (Part of the current solution is replacing a <br> with a textnode with a specific string, so I have a textual marker in the nodeValue where I can split the text on. Parsing this data really does feel like clutching at straws to get somewhere at times... Let me know if the fallback-option would be feasible (I'm not too familiar with the current set-up of the servers etc, so I can't really make a call on that one), or that I should continue parsing the table to JSON. Ronald Doug Schepers schreef op 2013-11-01 17:32: > Hi, Ronald– > > Thanks for the update! > > First, I didn't realize that Janet Swisher (Mozilla, and one of the > founders of this project) didn't know you were working on MDN data, or > she would have introduced you to someone at Mozilla. Maybe that's the > contact you made already. In any case, she can confirm that. > > I'm okay with keeping the data as tables for now, if that makes your > life easier. But I do want to note that it would be much better to > have it as JSON, because that's what the MediaWiki extension is > expecting. > > If we keep the data as tables, we will need to rewrite the MediaWiki > extension to deal with that instead; it would also make it difficult > (maybe prohibitively so) to make the "icon view" at the top of the > page, since we'd need to parse and reformat the data. > > So, there is extra work to be done either way: either we rewrite the > extension and lose some functionality (for now); or we find a way to > parse the MDN tables into JSON. I don't know which would be more work. > I do know that JSON is the final format we want the data in, so I'd > like to shoot for that if we can. > > I don't want to put all the work on you, especially since you've been > so awesome on driving this forward. How about as a next step, you > expose the data you've collected, and someone (me?) looks at making a > regex that normalizes it, or at least assesses which approach will be > more work? > > Again, if we can get by with tables with not much work, then I agree > we should do that. > > Regards- > -Doug > > On 11/1/13 12:50 PM, Ronald Mansveld wrote: >> It have been some pretty productive days, with both ups and downs. >> >> the data from both CIU and H5T have been pretty easy to parse, mostly >> because this data is already available in JSON-format. MDN-data is a >> different story though. >> >> At this point, the MDN-data is _not_ available as JSON. I can get a >> JSON-feed, but that only states that a compatibility-section is >> available. It doesn't give the data. So, I had to resort to scraping. >> >> However, even though the data may be in a table, which makes the >> general >> parsing pretty easy, some of the data actually isn't that nice >> embedded >> in tags. For instance: the version-numbers for prefixed use and >> non-prefixed use are only separated by a line-break. >> >> I've come a long way, but it most certainly isn't yet at the level I >> want it to be. So I'm actually thinking of not even trying to parse >> the >> MDN-data, and just use the HTML-table as is. >> >> By parsing the CIU and H5T data into tables of the same formatting, >> we >> still can have a uniformed layout on the site. >> >> >> I have been given a contact within MDN, so I'll try to work with them >> to >> make the data available as JSON, so we can do a better integration >> after >> this first phase. >> >> >> Any thoughts/comments? >> >> I'll continue working once I'm back in NL, if no-one objects, I'd >> like >> to go for the table option, which could be up and running pretty >> soon. I >> don't see too many downsides, given the fact this is just a temporary >> solution so we can go live with the CSS-part of the site, and a more >> future-proof solution will be build once this is up and running. >> >> >> Ronald >> >> >> >> >> Doug Schepers schreef op 2013-10-30 17:59: >>> Hi, Ronald– >>> >>> Thanks for the update! Looking forward to seeing it. >>> >>> Since we eventually plan to have tests for each assertion, and >>> results based on running those tests against browsers (versions, >>> OSs, >>> etc.), it makes the most sense to expand the data from MDN to a >>> version-range, if that's doable. That will be the most consistent >>> with >>> our plans. >>> >>> Note that in reality, there are regressions. For example, Chrome has >>> dropped support for MathML, and other browsers have dropped features >>> as well (e.g. some SVG stuff). But we'll deal with that once the >>> infrastructure for reporting test results is more mature. >>> >>> Regards- >>> -Doug >>> >>> >>> On 10/30/13 11:29 AM, Ronald Mansveld wrote: >>>> OK, I've come a long way so far. There is just one decision to be >>>> made: >>>> >>>> MDN provides the compat data not per version, but rather a >>>> since-version. >>>> >>>> Both caniuse and html5test provide the data per version (where >>>> available). >>>> >>>> >>>> What do we want to use? I can collapse the data from caniuse and >>>> html5test to a since version pretty easily. Expanding the data from >>>> MDN from a since-version up to a complete version-range might be >>>> doable as well, although I have to rely on the browser-data >>>> provided >>>> in the feeds from CIU and H5T to determine what versions are >>>> available. >>>> >>>> Anyone with arguments towards or against either option? >>>> >>>> >>>> >>>> Ronald >>>> >>>> >>>> Doug Schepers schreef op 2013-10-29 06:18: >>>>> Hi, Ronald– >>>>> >>>>> Since we're going with this phased approach (which I fully >>>>> support), I think we should do 2 things: >>>>> >>>>> 1) Use the MDN data as the baseline, since they have fairly >>>>> complete data and a similar feature level as WPD (e.g., they have >>>>> basically the same page names as we do); this means you'll have to >>>>> collect this data via MDN's API; >>>>> >>>>> 2) Supplement that baseline data with CanIUse and HTML5Test data >>>>> where there is an equivalent feature name (e.g. "border-radius"); >>>>> we'll have to wait for QuirksMode and MobileHTML5 data until we >>>>> have the source for that, but we will launch an "explainer" page >>>>> that tells about all our data sources and our timeline. >>>>> >>>>> Does this seem like a doable approach? >>>>> >>>>> Regards- -Doug >>>>> >>>>> On 10/23/13 9:24 PM, Julee wrote: >>>>>> Thanks much, Ronald! And everyone who is sharing their data as >>>>>> is! >>>>>> >>>>>> I've sent feelers out regarding a work space in London next week. >>>>>> Will let you know if I hear anything. >>>>>> >>>>>> In the meantime, do you have a sense of how long it might take >>>>>> to normalize this phase-1 data? No biggie, just looking to fill >>>>>> out the CSS-properties schedule. >>>>>> >>>>>> Regards! >>>>>> >>>>>> Julee ---------------------------- julee@adobe.com @adobejulee >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- From: Ronald Mansveld >>>>>> <ronald@ronaldmansveld.nl> Date: Tuesday, October 22, 2013 3:47 >>>>>> PM To: Alex Komoroske <komoroske@google.com> Cc: Niels Leenheer >>>>>> <info@html5test.com>, julee <julee@adobe.com>, >>>>>> "public-webplatform-tests@w3.org" >>>>>> <public-webplatform-tests@w3.org> Subject: Re: WebPlatform >>>>>> Browser Support phased approach? >>>>>> >>>>>>> Alex Komoroske schreef op 2013-10-22 17:48: >>>>>>>> I strongly support a phased approach. I'm very excited about >>>>>>>> the prospect of having a more robust system set up, but as >>>>>>>> far as the CSS Properties launch goes, it's more important to >>>>>>>> have _something_, even if it's just a one-time import from a >>>>>>>> couple of sources. >>>>>>>> >>>>>>> >>>>>>> I feel like there is support to do a phased approach, plus we >>>>>>> have access to a (basic) set of data to get started. Coupled >>>>>>> with the urgency to get CSS live (which I absolutely support, >>>>>>> we've been in alpha long enough now ;) ), I think this is >>>>>>> indeed the right path to follow. Plus, this buys us time to >>>>>>> come up with a good plan and schemata for the data-exchange we >>>>>>> want to use in the future. >>>>>>> >>>>>>> >>>>>>> Next week I'll be in London, if anyone knows a place to work >>>>>>> for me I can start building the first scripts to parse the >>>>>>> data. I've checked out the Mozilla Open Office, but to me it's >>>>>>> pretty unclear whether that is still in use, and if so: if and >>>>>>> how I can use it. Do we have any Mozilla-employees on the list? >>>>>>> Or do we have Googlers that know if perhaps the Google office >>>>>>> can be used? Or any Londoners that know of a place? >>>>>>> >>>>>>> Worst case scenario I think I can use the City Business >>>>>>> Library, but my experience is that libraries are not always the >>>>>>> best place to work from, especially not if you try to make full >>>>>>> office hours. >>>>>>> >>>>>>> >>>>>>> Ronald >>>>>> >>>>>> >>>>>> >>>> >>
Received on Friday, 1 November 2013 17:48:10 UTC