I'm a recent convert to the logic and simplicity of the
ELT model of data extraction: Extract > Load > Transform. By emphasizing loading raw data in to the warehouse first, and cleaning it/modifying it second, you're afforded the flexibility of working with "renewable" data.
What I mean is:
- You don't need to identify your data structure and schemas ahead of time. Similar to the logic behind NoSQL development, you can just get the data and start working with it immediately. It's a shorter cycle from data to analytics.
- Cleaned and formatted data can be saved as views or materialized into tables on top of the raw data. So, if you need ten ways of processing one feed, you can do that! If you need to change your structure, you can rerun all of your new data through that!
- Unprocessed data means that schema and format changes from an API or feed don't immediately break your system. The data will still be collected and recorded, and you can update your views to reflect those changes afterwards. If you tried to transform it before loading it, there are situations where the flow of data could stop until the broken transformations are fixed.
Some examples of how/where you might clean data:
IN YOUR BI/ANALYTICS TOOL:
- The data modeling/cleaning can happen in your BI tool, assuming it allows you to create a data modeling layer (metadata layer?) on top of the underlying data. I'm a big fan of Metabase and use it to format, transform, and create all of my calculated metrics.
IN THE DATABASE:
- You may way to create dynamic views of the data, and adjust those views as needed, as requirements grow.
- Sometimes, you just need to remove a bunch of test data from the tables and webhooks without thinking about embedding whatever filters in all views moving forward. sometimes you just have bad, orphan records that will never be used and just need to be gone.
- Yet other times you might want to edit data - fixing names, adding comments, adjusting historic records.
Below, I'm going to focus on some quick scripts re: modifying Postgres data.
Removing Rows of Test Data
Sometimes, you have a bunch of test data sitting around in a production table. To delete these rows, first write a query that isolates the records you might want. For example, I might want to remove all records where the timestamp is empty:
SELECT * FROM "schema"."table" WHERE timestamp IS NULL;
Then, when you've isolated the data, you can delete it by replacing
SELECT * with
DELETE FROM "schema"."table" WHERE timestamp IS NULL;
Et Viola! Just check full records again to make sure it is all good. But be warned: this data can't be recovered!
Quick note on syntax:
Postgres is slightly different from standard SQL. I found the following syntax works for selecting columns with names with spaces: using double quotes for "column header" and single quotes for 'data':
DELETE FROM "schema"."table" WHERE "column" = 'data'
If you need to pattern match, use the
LIKE operator and wrap your data with
DELETE FROM "schema"."table" WHERE "column" LIKE '%data%'
First identify the data you want to modify. For example:
SELECT * FROM "schema"."table" WHERE "column"='messy data'
Then, run an update script to change the cell:
UPDATE "schema"."table" SET "column"='clean data' WHERE "column"='messy data';
It's a 1:1 bulk replacement of data.
Sometimes you need to add a column for new data added post-hoc. In my current role we've written a small tool that lets the team look at the data, identify what it is about, and then add a quick description for future reference. Here's the script to create new columns, which kicked that project off:
ALTER table "scheme"."table" ADD column notes VARCHAR default NULL;
notes is the column we created to log annotations.
Translating IDs into Names
Sometimes, the data comes in unusably raw. For example, our project management tool stores Task Behavior in one table, and User Names in a different table. Sure, we could use a join to combine the two tables, but why force a join into all of the analytics there if we can just work off of a view where names replace IDs? We hardcode those names so that even if someone changes their username, the analytics are still accurate.
For this, we can use a
CASE WHEN/ELSE script to create a new table, and then save it as a view. For example:
select *, case when assignee_id = '1234' then 'Roman' when assignee_id = '2345' then 'Amy' when assignee_id = '3456' then 'Bugsly' when assignee_id = '4567' then 'Gibson' else 'Name Undefined' end as "assignee_name" FROM "schema"."table"