Data Classifiers

SQL only supports primitive types like varchar, int, timestamp, decimal, etc. But, people reason over higher level data types. A varchar might be a name, or phone number. An integer might be a daily active user metric. SDF has the ability to annotate columns and tables with user defined types and then automatically propagate those types to downstream assets while respecting aggregations, or functional transformations of data. This unique capability fosters the creation of a dynamic semantic layer which adapts as you build out your data warehouse. Classifiers are first-class citizens of the SDF ecosystem and are an added layer of metadata on top of SQL models. They are completely compatible with all dialects and databases that SDF supports. You can think of them like rich types in a language like Typescript. They can be defined, reused, transformed, and propagated programatically by SDF.

SDF uses the term Classifier and SQL Type interchangeably. Any column or table may have 0 or more classifiers.

Using SQL Types & classifiers

What are popular examples of SQL Types?

Personally Identifiable Information like phone numbers, addresses, etc.
Metrics
Financial Information
Unique Identifiers
Retention Requiremenets
Time Grains

What Can You Do With SQL Types?

Write Reports
Write Checks
Audit
Prevent Business Logic Mismatch
Ensure Consistency Across Metrics
Customer Segmentation

Higher order types have many advantages.

Basic Propagation

Propagation is the mechanism of inferring the downstream column types from the upstream column types. The inference process is based on the semantics of the underlying transformation.

In the following example, the lineage sample workspace is used. If this workspace is not already set up, it can be created with an sdf new --sample lineage command.

Defining a Classifier Taxonomy

To work with classifiers and propagation we must first define all the relevant classifier domains. Here is an example of a simple PII classifier domain, defined in a separate section of the global workspace.sdf.yml config file.

workspace.sdf.yml

workspace:
  edition: "1.3"
  includes:
    - path: models/
---
classifier:
  name: PII
  labels:
    - name: Phone
    - name: Address
    - name: SSN
    - name: UID

This configuration defines five A top level type PII with 4 labels: PII.Phone, PII.Address, PII.SSN, and PII.UID, where the last four labels denote special classes of the first label. (I.e. if some data is labeled with PII.Address, it is also, implicitly, labeled with PII.)

Classifier domains can also be defined in separate sdf.yml files as long as these files are included by the workspace.sdf.yml file as one of the paths specified in includes.

Attaching Classifiers to Tables

Once a classifier domain is defined, we can manually attach classifiers to tables. Once root tables are annotated with the relevant classifiers, SDF automatically propagates classifiers downstream. Let’s start by labeling our source. We add a new file (models/source.sdf.yml) containing our type:

models/source.sdf.yml

table:
  name: source
  columns:
    - name: user_id
      classifiers:
        - PII.UID
    - name: phone
      classifiers:
        - PII.Phone

This file will automatically be picked up by SDF because the workspace already includes the /models directory. Alternatively, the same configuratin can be included directly into the workspace.sdf.yml file as follows:

workspace.sdf.yml

workspace:
  edition: "1.3"
  includes:
    - path : models/
---
classifier:
  name: PII
  labels:
    - name: Phone
    - name: Address
    - name: SSN
    - name: UID
    - name: AreaCode
---
table:
  name: source
  columns:
    - name: user_id
      classifiers:
        - PII.UID
    - name: phone
      classifiers:
        - PII.Phone

Notice - all .sdf.yml files are composable! You may include one or more configuration in any .sdf.yml file, as long as the schema is respected.

Inspecting Propagation Results

To see the classifer, and how it was propagated to all tables downstream of our source table, run sdf compile --show all.

sdf compile --show all

Schema chain.pub.source
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id     | bigint    | false       | PII.UID    |
| phone       | varchar   | false       | PII.Phone  |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.middle
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id     | bigint    | false       | PII.UID    |
| phone       | varchar   | false       | PII.Phone  |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.knis
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.sink
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| uid         | bigint    | false       | PII.UID    |
| phone       | varchar   | false       | PII.Phone  |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Observe how the classifiers attached to the upstream table automatically propagated downstream to middle and sink, but not to knis – because it doesn’t have any columns derived from the PII columns upstream.

Advanced Propagation

By default classifiers propagate through functions unchanged. For example, if a substring is extracted from a column labeled as PII.Phone, the result will also be labeled with PII.Phone by default. However, the result is no longer a phone number - maybe it’s just the area code! Let’s imagine the result of the substring expression is only area codes. We can use a function block in our sdf.yml files to define the behavior of a classifier in response to the function being called. In our current example, we’d want to reclassify PII.Phone to PII.AreaCode. Here’s the function overload required to do so:

workspace.sdf.yml

---
function:
  name: substring
  reclassify:
    - from: PII.Phone 
      to: PII.AreaCode

Great! Now every time the substring function is called on a column with the classifier PII.Phone, it will be reclassified to PII.AreaCode. Another common case is to prevent a classifier from propagating through an aggregation. For example, if a column user_id is labeled with the classifier PII.UID and we SELECT COUNT(DISTINCT user_id) from the table containing this column, we don’t want PII.UID to propagate since the COUNT DISTINCT is not a PII.UID. It’s a number representing the unique count of PII.UIDs. We can use the same reclassify block to prevent this propagation. Here’s an example:

workspace.sdf.yml

---
function:
  name: count
  parameters:
    - datatype: T
  returns:
    datatype: bigint
  reclassify:
    - from: PII.UID

By explicitly not setting a to value in the reclassify block, we remove the PII.UID classifier downstream and achieve our desired behavior.

Function reclassifications work across all queries by default. In this way, developers have full programmatic control over how their classifiers propagate and interact with functions across their entire data warehouse.

Lastly, this same reclassification mechanism can be used on a single column or table using the same reclassify block. Here’s an example below:

models/phone_trimmed.sdf.yml

table:
	name: phone_trimmed
	columns:
		- name: phone_first_three_digits
	reclassify:
		- from: PII.Phone
			to: PII.AreaCode

In this example, the column phone_first_three_digits will be reclassified to PII.AreaCode if it is derived from a column with the classifier PII.Phone.

Classifier States

Using labels, classifiers can also exist in states that communicate how the data is currently represented. Here is how a classifier state could be defined in YML:

workspace.sdf.yml

...
---
classifier:
  name: PII_STATE
  labels:
    - name: clear_text
    - name: hashed
    - name: anonymized
...

PII_STATE.clear_text is meant to represent human-readable PII, and anonymized is meant to represent anonymized PII. We can then use these states to define the effect of functions (including User Defined Functions (UDFs)) on classifiers like the examples above. For example, we can define the effect of an md5 hash as follows:

workspace.sdf.yml

---
function:
  name: md5
  reclassify:
    - from: PII_STATE.clear_text
      to: PII_STATE.hashed

Now let’s imagine we have a UDF called anonymize which anonymizes PII no matter its current state. For this, we can use a glob pattern (e.g. *) to tell SDF to reclassify a PII_STATE classifier no matter its current state. Here’s an example:

workspace.sdf.yml

---
function:
  name: anonymize
  parameters:
    - name: pii
      datatype: varchar
  returns:
    - datatype: varchar
  reclassify:
    - from: PII_STATE.*
      to: PII_STATE.anonymized

Note the usage of PII_STATE.* in the from field of the reclassify block. This tells SDF to reclassify any PII_STATE classifier, no matter its current state, to PII_STATE.anonymized.

Wondering why we also defined the parameters and returns in this function block? This is because this function is a User Defined Function (UDF).

Organizing Classifiers

When organizing classifiers, it is recommended to create a folder to host all Classfiers created by you and your team. Setting up a location within the working directory called classifiers is the recommended setting. Once created, this location will need to be ammended to the workspace.sdf.yml so that it is included.

workspace.sdf.yml

---
includes:
  - path: classifiers/

Introduction

Tutorials

Guides

Integrations

Data Classifiers

Using SQL Types & classifiers

Basic Propagation

Defining a Classifier Taxonomy

Attaching Classifiers to Tables

Inspecting Propagation Results

Advanced Propagation

Classifier States

Organizing Classifiers

Introduction

Tutorials

Guides

Integrations

​Using SQL Types & classifiers

​Basic Propagation

​Defining a Classifier Taxonomy

​Attaching Classifiers to Tables

​Inspecting Propagation Results

​Advanced Propagation

​Classifier States

​Organizing Classifiers

Using SQL Types & classifiers

Basic Propagation

Defining a Classifier Taxonomy

Attaching Classifiers to Tables

Inspecting Propagation Results

Advanced Propagation

Classifier States

Organizing Classifiers