It is important, also for regulatory reasons, to know which data is stored and what kind of data it is. It can be financial data, personal data or any other type. For GDPR reasons, its especially important to mark personal data and have a documentation ready, which data is stored and why.
This can be a challenging and annoying task if you have to do this for an existing database with a few hundred tables. Fortunately, there ways to make life easier.
Built-in Data Discovery and Classification
The easiest way is to use the built-in data discovery and classification feature that comes with Azure SQL Databases. Just go to the ‘Advanced data security’ tab and enable the feature:
By doing that, SQL Server will try to automatically detect Columns and Tables that need to be classified and you can accept or decline those changes:
You can also create a report out of it or export it.
SQL Script to classify data
If you don’t want to use that feature, but still need to classify the data, then another way is to write a SQL script which will create a table containing all classified data. The following script can be extended by columns or tables to classify those data:
SELECT schema_name(tab.schema_id) AS schema_name, tab.name AS table_name, col.name AS column_name, t.name AS data_type, -- start data classification CASE -- ignore specific columns in table WHEN CONCAT(tab.name, '.', col.name) IN ( 'my_table.ignore_this_column' ) -- ignore all columns in these tables OR tab.name IN ( 'table1', 'table2' ) THEN '' -- personal data -- match column names WHEN col.name IN ( 'first_name', 'middle_name', 'last_name', 'date_of_birth', 'email' ) -- match specific columns (table_name.column_name) OR CONCAT(tab.name, '.', col.name) IN ( 'user.state', 'table1.column1' ) THEN 'personal data' -- XXX Data -- match column names WHEN col.name IN ( 'column1', 'column2' ) -- match specific columns (table_name.column_name) OR CONCAT(tab.name, '.', col.name) IN ( 'users.display_name' ) THEN 'XXX data' ELSE '' END AS classification FROM sys.tables AS tab INNER JOIN sys.columns AS col ON tab.object_id = col.object_id LEFT JOIN sys.types AS t ON col.user_type_id = t.user_type_id ORDER BY schema_name, table_name, column_id
The result will be table that looks like:
Explanation of the SQL Script
This SQL Script basically contains at first, a selection of all schemes, tables and columns in the database:
SELECT schema_name(tab.schema_id) AS schema_name, tab.name AS table_name, col.name AS column_name, t.name AS data_type FROM sys.tables AS tab INNER JOIN sys.columns AS col ON tab.object_id = col.object_id LEFT JOIN sys.types AS t ON col.user_type_id = t.user_type_id ORDER BY schema_name, table_name, column_id
This is then extended by a column ‘classification’ and a simple CASE statement to e.g. classify all columns with name first_name or last_name as personal data:
CASE WHEN col.name IN ('first_name', 'last_name') THEN 'personal data' ELSE '' END AS classification
Finally, I added a few more CASEs, to allow to:
-- Do not classify specific column names (e.g.: classify all columns with name 'first_name' as personal data, BUT ignore the column in the table 'dogs') WHEN CONCAT(tab.name, '.', col.name) IN ( 'dogs.first_name' ) THEN '' -- Do not classify columns in specific tables (e.g. ignore all columns in table pets) WHEN tab.name IN ( 'pets' ) THEN '' -- Both combined in one statement WHEN CONCAT(tab.name, '.', col.name) IN ( 'dogs.first_name' ) OR tab.name IN ( 'pets' ) THEN '' -- Only match columns in a specific table (column 'display_name' is only personal data in table 'users') WHEN CONCAT(tab.name, '.', col.name) IN ( 'users.display_name' ) THEN 'personal data'
Additional information
- Azure SQL Database and SQL Data Warehouse data discovery & classificationhttps://docs.microsoft.com/en-us/azure/sql-database/sql-database-data-discovery-and-classification
- SQL Data Discovery and Classification: https://docs.microsoft.com/en-us/sql/relational-databases/security/sql-data-discovery-and-classification?view=sql-server-ver15
One response
[…] Automatic data classification for Azure SQL Server via Armin Reiter […]