.json data, and
nosql databases are great for when you're dealing with unstructured data and you don't know what the future will hold. CRMs are a great use case for this – for logging ever-increasing customer data that might be generated in a free-form way from different kinds of inputs.
But while this is great for agile development and accepting on-the-fly structural updates, it makes it difficult to work with in an analytics capacity (sorting, grouping, and analyzing columns), especially if you're used to working with tabular/columnar data such as in Excel.
The solution is to rotate (transpose) the table into a columnar structure. This can be done by:
- Identifying the key attributes you want to analyze, and then
- Creating a new view of the data.
My favorite approach is one suggested by the talented folks at Panoply: flattening out the data using CASE statements. For example, let's say your data looks like this:
A_I | ID | Key | Value 01 | 0001 | Name | John 02 | 0001 | City | New York 03 | 0001 | Gender | M 04 | 0002 | Name | Sally 05 | 0002 | City | Washington 06 | 0002 | Gender | F
... but for purposes of analytics, you want it to look like:
ID | Name | City | Gender 0001 | John | New York | M 0002 | Sally | Washington | F
Then, your solution would be:
SELECT id, MAX(CASE WHEN "key" = 'name' THEN value ELSE NULL END) name, MAX(CASE WHEN "key" = 'city' THEN value ELSE NULL END) city, MAX(CASE WHEN "key" = 'gender' THEN value ELSE NULL END) gender FROM table_name GROUP BY 1;
To break that down: you're selecting the data by the
ID field which connects the different key:value pairs, then creating each of the columns via the
Generally speaking, this case statement approach should work to map the key-value pairs you need. You will need as many cases as you have different types of keys. I've found that by the time you really need to rotate a
key:value table, you're only looking for a specific subset of the data and this can help you clear out the noise. You can also take this as an opportunity to cast certain values in correct notation (i.e., casting the date values into proper datetime format or whatnot).
Another benefit of this rotation is that you can modify the type of data during the process. For example, I tend to recast UNIX timestamps as a human-readable timestamps because that just makes my life so much easier when sharing the data with non-technical folks.
Saving the View
Last but not least, you want to make sure to save the view you've created. Just wrap your query in:
CREATE VIEW name [ ( [...your work here...] ) ] AS query
The benefit of having this as a view is that the view is linked to the original data. As things are updated or added or replaced, your view will reflect that.
You can also use the
CREATE TABLE AS syntax (https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_AS.html) but these tables will be static – great if the data isn't going to be updated, but awful if you need it to continue being refreshed.
Big thanks to the folks at Panoply who helped me work through this problem when we tried to turn Hubspot .json data into something easy to work with in Metabase.