Skip to content

Include files array with tokens in production-snapshot blob storage definitions #176

@elaine-mattos

Description

@elaine-mattos

Feature Request: Include files with tokens in production-snapshot

Hi ClearlyDefined community!

We're building an internal tool to help our developers manage license compliance, and we'd love to leverage ClearlyDefined's data. Specifically, we want to use the production-snapshot blob storage to avoid API rate limits while fetching license files for the libraries our teams use.

The Challenge

Currently, we're unable to retrieve license files from the attachments endpoint because:

  1. The backup process uses the definitions-trimmed MongoDB collection, which doesn't include the files array
  2. Without the files array and their token properties, we can't fetch license files via the /attachments/{token} endpoint

This makes it impossible to retrieve the actual license files for components stored in the production snapshots.

Proposed Solution

We'd like to propose modifying the backup job to:

  1. Fetch data from the definitions-paged collection (which includes the files array)
  2. Filter to keep only files that have a token property (making them easily retrievable)

Example

Current behavior - Files array is not present:

{
  "_id": "npm/npmjs/-/react-native-navigation-bar-color/2.0.2",
  "files": 16,
  "licensed": { ... }
}

Proposed behavior - Only files with tokens are included:

{
  "_id": "npm/npmjs/-/react-native-navigation-bar-color/2.0.2",
  "files": 16,
  "licensed": { ... }
  "files": [
    {
      "path": "package/LICENSE",
      "license": "MIT",
      "hashes": {
        "sha1": "8da5d6d75a66a60aedf29a5e70c07e4441b7cb13",
        "sha256": "4bcebe9a76f1fbdef1ca52e59f8a97d45444ccdf6816cf4e9ce19af60b9ad6a0"
      },
      "token": "4bcebe9a76f1fbdef1ca52e59f8a97d45444ccdf6816cf4e9ce19af60b9ad6a0"
    }
  ]
}

Benefits

  • Downstream tools can fetch license files directly from ClearlyDefined without hitting API limits
  • Snapshot size won't increase significantly since we're only including files with retrievable tokens
  • Makes it clear which files are available for retrieval
  • Maintains backwards compatibility (only adds data that was previously missing)

PR Available

We've implemented this feature in PR.
Happy to iterate based on community feedback!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions