-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Supporting 'callback' dependencies and outputs (for lack of a better term) would enable a number of interesting possibilities for using DVC to control processes that store data outside of DVC's file access, without requiring an explosion of remote access possibilities.
This would generalize what is possible with HTTP outputs and dependencies.
An example of what this could look like in a DVC file:
deps:
- cmd: python step-status.py --my-step
md5: <checksum>
Instead of consulting a file, as with path:
, DVC would run the specified command (which should be relatively quick), and compute the MD5 hash of its output. That command could do whatever it needs to in order to get the data status.
My specific use case is using DVC to control a large data import process that processes raw data files, loads them into PostgreSQL, and performs some additional computations involving intermediate files. I would implement a script that extracts data status from the PostgreSQL database so that DVC can check whether a step is up-to-date with the actual data currently in the database. I could implement this with HTTP dependencies and something like PostgREST, but that would introduce additional infrastructure requirements for people using the code (namely, a running PostgREST server).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status