PII, data models and metadata #1889

oatsandsugar · 2024-11-20T01:26:35Z

oatsandsugar
Nov 20, 2024
Maintainer

There are certain types of data that are required, for various reasons, to be treated differently. One of the key examples of that is Personally Identifiable Information (PII).

Tag-based solution

Within the Moose / Boreal system, the life-cycle of data is largely managed with respect to Data Models. These objects are the primitive by which data is ingested, stored, and processed through streaming functions, and efforts are being made to include the data model primitive in the block and egress APIs.

One way of dealing with PII or other types of sensitive data could be to attach “tags”.

This could be as simple as a boolean on a data model in total signifying that the model has PII to as granular as doing so on a “column” basis.

The presence of these tags could be used to restrict the way data is treated (e.g. can only be egressed in certain ways without encryption, etc).

Complex tags

However, if we are implementing tags, there is no reason to make them as limited as a boolean for PII.

First: PII could be considered more complex (e.g. with a view to differential privacy: “severity” being in the class, and with enough fields of a high enough severity, the source becomes PII)

Second: there isn’t really a reason not to allow key value pairs of arbitrary metadata to be included. Why not include internally sensitive data as a tag, or tags that relate to retention policy, etc.

What should we build

I’d love to hear perspectives, but it seems that a simple metadata system to be connected with the data model, and a system for referring to the data model’s metadata in transformations, Clickhouse policy, etc. would prove out this model.

Request for discussion

what are the types of protection you need for sensitive data?
what are the types of data you apply PII to?
how do you currently manage metadata?
how do your legal team set their policy? is their a translation to code? is the policy something you’d be amenable to be interpreted by LLMs for enaction in a data system?
any thoughts about implementation, especially about ergonomics where there may be multiple users (one building the data model, one tagging it as "PII")

03cranec · 2024-11-20T17:32:44Z

03cranec
Nov 20, 2024
Maintainer

I love the idea of tags being metadata that becomes available in other places that reference the Data Model:

Other Moose primitives like Functions/Blocks/APIs that reference the Data Model primitive (via python/ts conventions?)
Other services, like the Moose CLI, Boreal Cloud, or a third party data catalog (via API?)

But it's probably worth being really explicit that particular keyword tags (like #PII), could/should have direct effects on the underlying data and/or infra - eg. Automatically applying obfuscation transformations, automatically applying encryption in the database, automatically setting retention policy in the database, etc. This seems to be where tags become more "Moose-y" and do some real heavy lifting for the user, in terms of abstracting away infra complexity.

I'm also wondering if/how the solution could cover short tags like "PII", but also longer form metadata like descriptions?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PII, data models and metadata #1889

{{title}}

Replies: 1 comment

{{title}}

Select a reply

PII, data models and metadata #1889

oatsandsugar Nov 20, 2024 Maintainer

Tag-based solution

Complex tags

What should we build

Request for discussion

Replies: 1 comment

03cranec Nov 20, 2024 Maintainer

oatsandsugar
Nov 20, 2024
Maintainer

03cranec
Nov 20, 2024
Maintainer