When working with data warehouses, it’s common to have structured data stored within a table as a JSON blob. Until recently, extracting data from JSON in Redshift was extremely cumbersome. This tutorial shows you a new, easier way of working with JSON in Redshift.
For this tutorial, we’re going to assume we work for a company that teaches people how to code. When someone signs up for the company, we ask them a few questions and then recommend two languages for them. The data is stored in a table called recommendations
shown below:
SELECT * FROM recommendations
user_id | recommendations |
---|---|
1 | [{"language_id": 1, "language_name": "Python"}, {"language_id": 2, "language_name": "SQL"}] |
2 | [{"language_id": 3, "language_name": "R"}, {"language_id": 2, "language_name": "SQL"}] |
3 | [{"language_id": 2, "language_name": "SQL"}, {"language_id": 1, "language_name": "Python"}] |
4 | [{"language_id": 2, "language_name": "SQL"}, {"language_id": 3, "language_name": "R"}] |
5 | [{"language_id": 3, "language_name": "R"}, {"language_id": 1, "language_name": "Python"}] |
Up until recently, working with JSON data in Redshift was very difficult. We’d have to use two functions to extract the data:
JSON_EXTRACT_PATH_TEXT
, which extracts values from JSON key-value pairs.JSON_EXTRACT_ARRAY_ELEMENT_TEXT()
, which extracts values from JSON arrays.Let’s look at how we’d extract the data into columns using these functions:
SELECT
user_id,
JSON_EXTRACT_PATH_TEXT(
JSON_EXTRACT_ARRAY_ELEMENT_TEXT(
recommendations, 0), 'language_id') AS rec_1_id,
JSON_EXTRACT_PATH_TEXT(
JSON_EXTRACT_ARRAY_ELEMENT_TEXT(
recommendations, 0), 'language_name') AS rec_1_name,
JSON_EXTRACT_PATH_TEXT(
JSON_EXTRACT_ARRAY_ELEMENT_TEXT(
recommendations, 1), 'language_id') AS rec_2_id,
JSON_EXTRACT_PATH_TEXT(
JSON_EXTRACT_ARRAY_ELEMENT_TEXT(
recommendations, 1), 'language_name') AS rec_2_name
FROM recommendations
user_id | rec_1_id | rec_1_name | rec_2_id | rec_2_name |
---|---|---|---|---|
1 | 1 | Python | 2 | SQL |
2 | 3 | R | 2 | SQL |
3 | 2 | SQL | 1 | Python |
4 | 2 | SQL | 3 | R |
5 | 3 | R | 1 | Python |
This code is hard to write, read, and maintain! Luckily, there is now a better way.
In April 2021, Redshift announced the SUPER
type, and with it better JSON support. This brings it inline with Snowflake and Bigquery in terms of ease of use.
To work with the SUPER
type, we first have to use the (at time of writing) undocumented JSON_PARSE()
function to convert our JSON string into the SUPER
type. Once we’ve done that, we can use brackets ([]
) to retrieve and extract our JSON data. Let’s take a look:
WITH recs_super AS (
SELECT
user_id,
JSON_PARSE(recommendations) AS recommendations
FROM recommendations
)
SELECT
user_id,
recommendations[0]['language_id'] AS rec_1_id,
recommendations[0]['language_name'] AS rec_1_name,
recommendations[1]['language_id'] AS rec_2_id,
recommendations[1]['language_name'] AS rec_2_name
FROM recs_super
user_id | rec_1_id | rec_1_name | rec_2_id | rec_2_name |
---|---|---|---|---|
1 | 1 | Python | 2 | SQL |
2 | 3 | R | 2 | SQL |
3 | 2 | SQL | 1 | Python |
4 | 2 | SQL | 3 | R |
5 | 3 | R | 1 | Python |
The code is clear and easy to understand!
As you can see, this new type makes working with JSON in Redshift dramatically easier. To find out more, you can consult the Redshift documentation on the SUPER
type.