PII, data models and metadata #1889
Replies: 1 comment
-
I love the idea of tags being metadata that becomes available in other places that reference the Data Model:
But it's probably worth being really explicit that particular keyword tags (like #PII), could/should have direct effects on the underlying data and/or infra - eg. Automatically applying obfuscation transformations, automatically applying encryption in the database, automatically setting retention policy in the database, etc. This seems to be where tags become more "Moose-y" and do some real heavy lifting for the user, in terms of abstracting away infra complexity. I'm also wondering if/how the solution could cover short tags like "PII", but also longer form metadata like descriptions? |
Beta Was this translation helpful? Give feedback.
-
There are certain types of data that are required, for various reasons, to be treated differently. One of the key examples of that is Personally Identifiable Information (PII).
Tag-based solution
Within the Moose / Boreal system, the life-cycle of data is largely managed with respect to Data Models. These objects are the primitive by which data is ingested, stored, and processed through streaming functions, and efforts are being made to include the data model primitive in the block and egress APIs.
One way of dealing with PII or other types of sensitive data could be to attach “tags”.
This could be as simple as a boolean on a data model in total signifying that the model has PII to as granular as doing so on a “column” basis.
The presence of these tags could be used to restrict the way data is treated (e.g. can only be egressed in certain ways without encryption, etc).
Complex tags
However, if we are implementing tags, there is no reason to make them as limited as a boolean for PII.
First: PII could be considered more complex (e.g. with a view to differential privacy: “severity” being in the class, and with enough fields of a high enough severity, the source becomes PII)
Second: there isn’t really a reason not to allow key value pairs of arbitrary metadata to be included. Why not include internally sensitive data as a tag, or tags that relate to retention policy, etc.
What should we build
I’d love to hear perspectives, but it seems that a simple metadata system to be connected with the data model, and a system for referring to the data model’s metadata in transformations, Clickhouse policy, etc. would prove out this model.
Request for discussion
Beta Was this translation helpful? Give feedback.
All reactions