varchar, int, timestamp, decimal, etc. But, people reason over higher level data types. A varchar might be a name, or phone number. An integer might be a daily active user metric.
SDF has the ability to annotate columns and tables with user defined types and then automatically propagate those types to downstream assets while respecting aggregations, or functional transformations of data. This
unique capability fosters the creation of a dynamic semantic layer which adapts as you build out your data warehouse.
Classifiers are first-class citizens of the SDF ecosystem and are an added layer of metadata on top of SQL models. They are completely compatible with all dialects and databases that SDF supports.
You can think of them like rich types in a language like Typescript. They can be defined, reused, transformed, and propagated programatically by SDF.
SDF uses the term Classifier and SQL Type interchangeably. Any column or table may have 0 or more classifiers.
Using SQL Types & classifiers
What are popular examples of SQL Types?- Personally Identifiable Information like phone numbers, addresses, etc.
- Metrics
- Financial Information
- Unique Identifiers
- Retention Requiremenets
- Time Grains
- Write Reports
- Write Checks
- Audit
- Prevent Business Logic Mismatch
- Ensure Consistency Across Metrics
- Customer Segmentation
Basic Propagation
Propagation is the mechanism of inferring the downstream column types from the upstream column types. The inference process is based on the semantics of the underlying transformation.In the following example, the
lineage sample workspace is used. If this workspace is not already
set up, it can be created with an sdf new --sample lineage command.Defining a Classifier Taxonomy
To work with classifiers and propagation we must first define all the relevant classifier domains. Here is an example of a simple PII classifier domain, defined in a separate section of the globalworkspace.sdf.yml config file.
workspace.sdf.yml
PII with 4 labels: PII.Phone,
PII.Address, PII.SSN, and PII.UID, where the last four labels denote
special classes of the first label. (I.e. if some data is labeled with
PII.Address, it is also, implicitly, labeled with PII.)
Classifier domains can also be defined in separate
sdf.yml files as long as
these files are included by the workspace.sdf.yml file as one of the
paths specified in includes.Attaching Classifiers to Tables
Once a classifier domain is defined, we can manually attach classifiers to tables. Once root tables are annotated with the relevant classifiers, SDF automatically propagates classifiers downstream. Let’s start by labeling oursource. We add a new file (models/source.sdf.yml) containing our type:
models/source.sdf.yml
/models directory. Alternatively, the same configuratin can be included
directly into the workspace.sdf.yml file as follows:
workspace.sdf.yml
Inspecting Propagation Results
To see the classifer, and how it was propagated to all tables downstream of oursource table, run sdf compile --show all.
middle and sink, but not to knis – because it
doesn’t have any columns derived from the PII columns upstream.
Advanced Propagation
By default classifiers propagate through functions unchanged. For example, if a substring is extracted from a column labeled asPII.Phone, the result
will also be labeled with PII.Phone by default.
However, the result is no longer a phone number - maybe it’s just the area code!
Let’s imagine the result of the substring expression is only area codes. We can use a function block in our sdf.yml files to define the behavior of a classifier in response to the function being called.
In our current example, we’d want to reclassify PII.Phone to PII.AreaCode. Here’s the function overload required to do so:
workspace.sdf.yml
PII.Phone, it will be reclassified to PII.AreaCode.
Another common case is to prevent a classifier from propagating through an aggregation. For example, if a column user_id is labeled with the classifier PII.UID and we SELECT COUNT(DISTINCT user_id) from the table containing this column, we don’t want PII.UID to propagate since the COUNT DISTINCT is not a PII.UID. It’s a number representing the unique count of PII.UIDs. We can use the same reclassify block to prevent this propagation. Here’s an example:
workspace.sdf.yml
to value in the reclassify block, we remove the PII.UID classifier downstream and achieve our desired behavior.
Function reclassifications work across all queries by default. In this way, developers have full programmatic control over how their classifiers propagate and interact with functions across their entire data warehouse.
reclassify block. Here’s an example below:
models/phone_trimmed.sdf.yml
phone_first_three_digits will be reclassified to PII.AreaCode if it is derived from a column with the classifier PII.Phone.
Classifier States
Using labels, classifiers can also exist in states that communicate how the data is currently represented. Here is how a classifier state could be defined in YML:workspace.sdf.yml
PII_STATE.clear_text is meant to represent human-readable PII, and anonymized is meant to represent anonymized PII. We can then use these states to define the effect of functions (including User Defined Functions (UDFs)) on classifiers like the examples above. For example, we can define the effect of an md5 hash as follows:
workspace.sdf.yml
anonymize which anonymizes PII no matter its current state. For this, we can use a glob pattern (e.g. *) to tell SDF to reclassify a PII_STATE classifier no matter its current state. Here’s an example:
workspace.sdf.yml
PII_STATE.* in the from field of the reclassify block. This tells SDF to reclassify any PII_STATE classifier, no matter its current state, to PII_STATE.anonymized.
Wondering why we also defined the
parameters and returns in this function block? This is because this function is a User Defined Function (UDF).Organizing Classifiers
When organizing classifiers, it is recommended to create a folder to host all Classfiers created by you and your team. Setting up a location within the working directory calledclassifiers is the recommended setting.
Once created, this location will need to be ammended to the workspace.sdf.yml so that it is included.
workspace.sdf.yml