Some days ago, Google announced a major improvement for Google Search Console: a simple data export to BigQuery. To be honest, I haven’t tried it yet but no matter what, it is good news. This means several things:
- Companies will be able to extract easily their GSC data to their data lakes. It was already possible previously if you knew how to extract the GSC data, through the API, to BigQuery, but this new integration will be easier and includes more data points.
- I already explained in a previous post what the limitations of the API were, I hope this integration will solve at least a part of them
- SEO professionals will start using SQL more extensively. Some of them were already doing it, but this integration will increase the necessity of knowing at least the basics of this language. If you are worried about the cost: don’t! I have currently a database with 150M rows in BigQuery, and it costs me less than $5 per month to use it.
I’ve been using SQL for almost 3 years now, and the objective of this post is to explain the basics of SQL for SEO analysis and some simple, yet powerful, use cases. This is not an ultimate guide, and if you have advanced need you can have a look at the following courses: Codecademy, Sololearn, Dataquest.
Why should you learn SQL?
SQL is the most common language used to query modern databases. This includes simple website databases (such as the WordPress I’m using), but also more advanced data warehouses such as BigQuery. You can use it without knowing SQL through Connected Sheets, but you’ll still need to write SQL in some cases.
Apart from that, bear in mind that Excel and Sheets have limitations and if you have to handle many rows, you may face some issues. Why? Because these tools were not designed to hold millions of rows, when databases are specifically designed for that.
Other advantages of SQL that are worth mentioning:
- Easier learning curve than most programming languages such as Python. You can get started in a couple of hours.
- Unlike Excel, your logic will be easier to understand because it will be fully written. Forget the pain to reverse engineer the logic behind a full Excel file.
- You can run the same query several times (same task, different projects …) and save time to focus on tasks where you have more added value.
Do you need to learn SQL as an SEO?
No! SQL is not mandatory, but depending on the projects you work on, it can help you work better and/or faster. This should not be at the top of your hard skills list if you’re starting your training in this field.
SQL basics
To explain the SQL basics, I’ll be working with a GSC database including two dimensions (page & date) and four metrics (clicks, impressions, ctr, position). If you’ve opened a GSC dashboard before, you should be familiar with all of these. If that’s not the case, they are defined in the documentation.
A simple SQL query is usually composed of three parts:
- SELECT: the columns we want to get. You’re often interested in a limited number of columns, even if your table has a lot of them.
- FROM: the table we extract the data from. You’ll usually have a dataset composed of several tables. You need to specify the name; otherwise SQL won’t know which table you want to extract the data from.
- WHERE: the logic we apply to our data. In our example below, we just want the total of clicks after the 01/01/2022.
Simple, right? SELECT, FROM and WHERE are called SQL clauses. They have specific meanings (see full list) and must be used in a specific order. You cannot use the FROM before the SELECT for instance: SQL will throw an error.
As I mentioned, the objective of this article is not to explain everything to master SQL, but the basics you need to leverage your data. In the following sections, I’ll explain what the most important clauses are.
SELECT
The SELECT clause is a fundamental component of SQL that is used to retrieve data from one or more tables in a database. Some tips:
- You can specify a new name for your columns using the AS statement followed by the new name. I highly recommend doing that when you use aggregation methods (such as SUM() in my example) in BigQuery because otherwise you’ll end up with names such as f0_ which are confusing.
- If you want to select more than one column, you must separate the names by a comma. If for any reason you need to select all columns, use select *. Do not do that on huge tables because it may take a while and/or cost you a lot if you’re using BigQuery.
WHERE
As we already explained, the WHERE clause is used to apply conditions. You can apply conditions on columns that are not inside your SELECT clause, which is one of the advantages of SQL over Excel. It is actually what I did in my previous examples. Some comments on this clause:
- You can obviously combine conditions using the AND or OR instructions
- You can use the most common mathematical operators, such as = (equal to), != (not equal to) etc…
- To avoid creating countless conditions, you can use the IN to check if a column is equals to any of a list of provided values
- To check if a column starts, ends or contains a specific string, you can use the LIKE operator. With this operator, the percent symbol must be interpreted as “anything or nothing”.
- wik% will check if a string starts with “wik”
- %wik will check if a string ends with “wik”
- %wik% will check if a string contains “wik”
GROUP BY
This clause allows you to group data. As I explained, our table includes data per URL and per date: how can we get the total by date, for instance? Using GROUP BY!
Some important details:
- This operation is similar to a pivot table in Excel.
- You cannot group data by a column you didn’t add within your SELECT clause. This would throw an error.
- If you use GROUP BY with ORDER BY and LIMIT, you can answer to questions such as “What are the X most performing pages”
CASE
The CASE statement is useful when you want to create a new column in your result based on one or several conditions. For instance, let’s imagine that you would like to understand what percentage of the traffic is generated by the homepage and compare it to the rest. You can create a simple SQL CASE to do just that:
- You can include as many conditions as you want, but you always need to follow the WHEN {conditions} THEN {value} logic
- You need to end your conditions by an END and then assign a name to your new column
In an SEO context, you will often use this clause to categorize your pages, even if there are other use cases, obviously.
JOIN
The main advantage you have when you add your data from several sources in a data warehouse, is the ability to easily merge them. You can, for instance, get the number of orphan pages but generating traffic on your site using two different data sources:
- The GSC table we already used in our previous examples
- A new table, containing the data from a crawl done with Screaming Frog or Oncrawl
In Excel, if you want to merge data, you’re very likely to use the VLOOKUP formula. Its closest equivalent in SQL is JOIN, and the most common are:
- INNER JOIN: you merge data by combining rows that have matching values in the two tables
- LEFT JOIN: you merge data by keeping all data from the left table. When you use VLOOKUP in Excel, you usually perform an operation similar to this kind of JOIN.
- FULL OUTER JOIN: you merge data by keeping data from both tables
If you’re not familiar with these concepts, the following image (source) should help you understand:
Going back to our gsc and screaming_frog tables, we can for instance use the JOIN statement to get the number of clicks by depth using the following logic:
- The clicks will come from the gsc table
- The depth will come from the screaming_frog table
- These two tables are merged using the page column
The code we’d use will be the following:
select
--the data we need
--the column name starts with the name of the table followed by a dot because we're using more than one
--It is not mandatory if you don't have columns with the same names in your tables
SUM(gsc.clicks) as clicks,
screaming_frog.depth
--our left table
from gsc
--our right table and the join logic
left join screaming_frog on gsc.page = screaming_frog.URL
--rest of the SQL code
group by depth
order by depth asc
This code would return the following table. Note that we have a row with no depth information: these are the URLs found in the gsc table, but not in screaming frog. If we had performed an INNER or RIGHT join, this row would not be here.
Conclusion
Is that it? Of course not, but if you master these concepts, you’ll be able to start using SQL. Other concepts, such as subqueries, can be learned along the way when you’ve mastered the basics.
Again, SQL is not a mandatory hard skill to work in SEO, but it will definitely improve the way you work if you have to deal with a lot of data for your clients or projects.