This tutorial shows you how to parse URLs using Snowflake SQL.
Snowflake is, in my opinion, the best data warehousing solution on the market. It is easy to manage and the clear separation of storage and compute make for both performant and cost effective solution. Unless there is an extremely compelling reason to chose a competitor (Redshift, Bigquery) it’s what I recommend to most of my clients.
The flavor of SQL that Snowflake supports will be familiar to those who have used Redshift (or PostgreSQL) before. In most cases, you can write SQL as you ‘know it’ and it will be accepted fine. That said, there are many extra functions and ‘hidden gems’ in Snowflake. This post is part of a series of tutorials for lesser-known Snowflake functions.
This tutorial uses regular expressions, but don’t worry if you don’t know how they work. That said, if you don’t already, it might be useful to read the introduction of the Wikipedia article to understand what they are.
Let’s say we have the following URL:
select
'https://mydomain.com/url/path?param_1=hello¶m2=there' as url
URL |
---|
https://mydomain.com/url/path?param_1=hello¶m2=there |
And our aim is to break it up into it’s components:
https
mydomain.com
/url/path
param_1=hello¶m_2=there
Using most SQL dialects, we would need to use a regular expression (regex) to extract these parts out. This is what a regular expression that extracts these components would look like
This is what this might look like in regex form:
^(.*):\/?\/?([\w\-\.]+)([^#?\s]+)\??(.*)?$
That might look like my cat just walked across my keyboard, but in fact it’s a regex that defines four capture groups (inside four sets of parentheses) which match the four components above. If you want to see how it works, I highly recommend this interactive example of the above regex.
So what would this look like in SQL? Well, we’d have to use regexp_substr()
(or a similar function, the names for regex functions vary between SQL dialects) with that long regex string for each component. Let’s take a look:
with url_pattern as (
select
'https://mydomain.com/url/path?param_1=hello¶m2=there' as url,
'^(.*):\\/?\\/?([\\w\\-\\.]+)([^#?\\s]+)\\??(.*)?$' as pattern
)
select
regexp_substr(url, pattern, 1, 1, 'e', 1) as url_scheme,
regexp_substr(url, pattern, 1, 1, 'e', 2) as url_domain,
regexp_substr(url, pattern, 1, 1, 'e', 3) as url_path,
regexp_substr(url, pattern, 1, 1, 'e', 4) as url_query
from url_pattern
URL_SCHEME | URL_DOMAIN | URL_PATH | URL_QUERY |
---|---|---|---|
https |
mydomain.com |
/url/path |
param_1=hello¶m2=there |
Note that single back slashes (\
) are escaped (\\
) in the code above.
Luckily, with Snowflake there’s an easier way! The parse_url()
function can do all of this, and more!
The parse_url
function returns a JSON object. Let’s look at what that looks like:
{
"fragment": null,
"host": "mydomain.com",
"parameters": {
"param2": "there",
"param_1": "hello"
},
"path": "url/path",
"port": null,
"query": "param_1=hello¶m2=there",
"scheme": "https"
}
We can access the components using standard JSON Snowflake accessors:
with url as (
select
'https://mydomain.com/url/path?param_1=hello¶m2=there' as url
)
select
parse_url(url):scheme::varchar as url_scheme,
parse_url(url):host::varchar as url_domain,
parse_url(url):path::varchar as url_path,
parse_url(url):query::varchar as url_query
FROM url
_SCHEME | URL_DOMAIN | URL_PATH | URL_QUERY |
---|---|---|---|
https |
mydomain.com |
url/path |
param_1=hello¶m2=there |
You also have the ability to parse out the port, any URL fragments, and the individual parameters from the JSON that parse_url()
returns.
I hope that parse_url()
is a useful addition to your Snowflake-SQL toolbox.