Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tech Report: Technologies - major.minor versions granularity #48

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Jan 12, 2025

Related to HTTPArchive/httparchive.org#984

As the aggregation changes we have new schemas, and new tables for tech report.

I placed them in reports dataset:

  • tech_crux (successor of core_web_vitals.technologies)
  • tech_report_adoption
  • tech_report_categories
  • tech_report_core_web_vitals
  • tech_report_lighthouse
  • tech_report_page_weight
  • tech_report_technologies
  • tech_report_versions

Notes:

  • removed a few columns from tech_crux (as compared to core_web_vitals.technologies):

    • category
    • origins_with_good_cwv_2023 and origins_with_good_cwv_2024 - deduplicated in origins_with_good_cwv
  • removed empty similar_technologies column from technologies

  • all the metrics have 'ALL' version that aggregates at technology level and expected to match the current values:
    Screenshot 2025-01-26 at 23 55 37

  • corresponding to the current approach tech_report_versions has full adoption data from crawl.pages and tech_report_adoption has the smaller absolute values because of the JOIN with CrUX.
    Screenshot 2025-01-27 at 00 05 01

  • example of the technology versions:

SELECT
  version,
  origins
FROM `reports.tech_report_versions`
WHERE technology = 'WordPress' AND
  client = 'mobile'
ORDER BY
  origins DESC

Screenshot 2025-01-26 at 23 44 13

@max-ostapenko max-ostapenko changed the title Major versions granularity for Tech Reports Technology major versions granularity for Tech Reports Jan 12, 2025
@max-ostapenko max-ostapenko changed the title Technology major versions granularity for Tech Reports Tech Report: Technologies - major versions granularity Jan 12, 2025
@max-ostapenko max-ostapenko marked this pull request as ready for review January 26, 2025 22:46
definitions/output/reports/tech_crux.js Outdated Show resolved Hide resolved
definitions/output/reports/tech_crux.js Outdated Show resolved Hide resolved
definitions/output/reports/tech_crux.js Outdated Show resolved Hide resolved
definitions/output/reports/tech_crux.js Outdated Show resolved Hide resolved
definitions/output/reports/tech_crux.js Outdated Show resolved Hide resolved
@max-ostapenko
Copy link
Contributor Author

max-ostapenko commented Jan 27, 2025

@tunetheweb @rviscomi FYI
removed a few columns from tech_crux (as compared to core_web_vitals.technologies which is to be deprecated):

  • category
  • origins_with_good_cwv_2023 and origins_with_good_cwv_2024 - deduplicated in origins_with_good_cwv

@max-ostapenko max-ostapenko changed the title Tech Report: Technologies - major versions granularity Tech Report: Technologies - major.minor versions granularity Jan 30, 2025
@max-ostapenko
Copy link
Contributor Author

max-ostapenko commented Jan 30, 2025

Some test cases for the version pattern:

SELECT
  version,
  REGEXP_EXTRACT(version, r'\d+(?:\.\d+)?') AS major_minor
FROM UNNEST(['1.2.3', '01976.2.83', '0003.3.4', '0.0.1', '1.2', 'version 5.1.2', '8']) AS version
version major_minor
1.2.3 1.2
01976.2.83 01976.2
0003.3.4 0003.3
0.0.1 0.0
1.2 1.2
version 5.1.2 5.1
8 8

@max-ostapenko
Copy link
Contributor Author

@rviscomi @tunetheweb After expanding the pattern to major + minor it's now obvious how messy the data is.
I've tried with a stricter (semver approach) ^((?:0|[1-9]\d*)\.(?:0|[1-9]\d*))\.(?:0|[1-9]\d*) and more relaxed \d+(?:\.\d+)? patterns, but neither one is great:
Screenshot 2025-01-31 at 21 45 48

Examples of the technologies that are omitted in stricter version:
Screenshot 2025-01-31 at 21 42 34

I was thinking to do something like IF(stricter IS NOT NULL, stricter, relaxed).
WDYT? Any suggestions?

@rviscomi
Copy link
Member

I think relaxed is fine - we're just echoing what the site declared its version to be. When we aggregate technologies by version, the most popular ones will bubble up to the top anyway.

@max-ostapenko
Copy link
Contributor Author

Then, if no more questions, it's ready to be merged.

@max-ostapenko
Copy link
Contributor Author

max-ostapenko commented Jan 31, 2025

Or considering we don't want to look into a long tail, maybe we limit to top 50 versions per technology?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants