Skip to content

Support callback dependencies #2378

@mdekstrand

Description

@mdekstrand

Supporting 'callback' dependencies and outputs (for lack of a better term) would enable a number of interesting possibilities for using DVC to control processes that store data outside of DVC's file access, without requiring an explosion of remote access possibilities.

This would generalize what is possible with HTTP outputs and dependencies.

An example of what this could look like in a DVC file:

deps:
- cmd: python step-status.py --my-step
  md5: <checksum>

Instead of consulting a file, as with path:, DVC would run the specified command (which should be relatively quick), and compute the MD5 hash of its output. That command could do whatever it needs to in order to get the data status.

My specific use case is using DVC to control a large data import process that processes raw data files, loads them into PostgreSQL, and performs some additional computations involving intermediate files. I would implement a script that extracts data status from the PostgreSQL database so that DVC can check whether a step is up-to-date with the actual data currently in the database. I could implement this with HTTP dependencies and something like PostgREST, but that would introduce additional infrastructure requirements for people using the code (namely, a running PostgREST server).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions