Google Analytics packs a ton of data that helps you grow your business.
By analyzing that data, you can uncover valuable insights about your website performance and visitor behavior. However, there’s one problem that can undermine the reliability of your Google Analytics reports.
It’s none other than data sampling. 👀
In this article, we’ll show you how data sampling affects your reports and how you can avoid it. So make yourself a cup of tea, and let’s dive in.
Fast forward ⏩
- What is sampling in Google Analytics?
- Why is sampling an issue?
- How does sampling work in Google Analytics?
- Sampling in Google Analytisc 4
- 5 ways to avoid sampling
- How to avoid sampling with Supermetrics
What is sampling in Google Analytics?
Let’s say you have a bag with 5,000 marbles and you want to know how many red marbles there are. Instead of spending hours counting the marbles one by one, you can randomly take some out of the bag and check how many of those are red. Based on that sample, you can easily estimate the number of red marbles in the bag.
That’s a rough description of sampling.
In statistical analysis, data sampling means taking a small slice of the whole dataset and analyzing it for trends or for verifying hypotheses.
Since Google Analytics is the most widely used web analytics tool, it has to process and handle an enormous quantity of data relatively quickly. To make the best of both worlds, a.k.a speed and accuracy, Google randomly samples a portion of your traffic data.
The biggest advantages of sampling are, of course, time-saving and cost-saving. Google can deal with a much smaller and manageable sample yet still produce similar results.
But if the results are almost the same, why shouldn’t you just work with sampled data? Why bother getting unsampled data?
Here’s why. 👇
Why sampling can hide crucial insights from your data
Remember the marble example? There are some scenarios where your calculations may go wrong:
- The sample size is too small. For example, if you only count 20 out of 5000 marbles.
- The distribution of red marbles is unequal. For example, you accidentally scooped up more red marbles in your sample because they were clustered together.
In both cases, the samples don’t represent the entire picuture. And that’s also the problem with sampling — it leads to uncertainty and distrust in your reports.
While a smaller dataset is much easier to handle, it doesn’t give you statistical significance. Your sample may or may not reflect the true nature of your data.
Let’s say you’re running two campaigns — A and B. Campaign A has a 10.5% conversion rate; campaign B has an 8.3% conversion rate. The results may seem obvious that campaign A is a clear winner. However, in reality, the sample you analyze may not be big enough to represent the whole population, and there’s no distinct difference between the two campaigns.
Psst! If you’d like to learn more about statistical significance or how you can improve your statistical thinking, we recommend you to check out two episodes of our podcast, the Marketing Analytics Show, where we had Khrystyna Grynko and Tim Wilson as guests.
This kind of ambiguity is the opposite of how we expect analytics to work. The whole reason why we even decided to use Google Analytics was to get accurate numbers on our traffic and users.
In her talk about sampling and sample size, Jess Spate says, “There’s no actual threshold where the sample size starts reflecting the behavior of the whole dataset. Because that depends on the size of the data and sample, and the variation within the sample.”
To ensure your data is meaningful and reliable, you’ll first need to get your hands on the unsampled data. But before we show you some workarounds to sampling, let’s take a look at how sampling works in Google Analytics.
How sampling works in Google Analytics
How do you know if your data is sampled?
You can immediately tell if your data is sampled by looking at the shield icon on the top of your report.
Green means your report is created with all your data.
Yellow means you’re looking at sampled data. If you hover your mouse over the yellow shield icon, you’ll see the message: ”The report is based on X% of sessions.” It tells you how big the sample size is.
To adjust the sample size, simply click on the drop-down menu and switch between ‘Faster response’ and ‘Greater precision’.
When you select ‘Faster response’, Google Analytics will take a smaller sample from your dataset to create reports. That way, you’ll get your data much faster.
‘Greater response’ means your report is created with the largest sample size possible. So unless you want to speed up the processing time, you should stick with ‘Greater precision’.
When does Google Analytics sample data?
To create your reports, Google Analytics first collects raw data in visit tables. Then, it aggregates the data and stores it in default (or standard) reports. This process lets Google Analytics quickly retrieve your data without sampling. There are five types of default reports:
All default reports are unsampled.
Ad hoc reports
Occasionally, you’ll want to modify the default reports to get more insights. For example, you may want to add a secondary metric, a new filter, a new segment, or even create a custom report. Whenever customization happens, Google Analytics will first check the default report to see if the data you request is available.
If the relevant data is unavailable, Google Analytics will check the sessions in the visit tables. If there are too many sessions, Google Analytics will sample the data to deliver your report.
So does that mean ad hoc reports are always sampled? Not really.
Query limits, cardinality, and thresholds
Creating ad hoc reports or custom reports doesn’t always trigger sampling. There’s a certain threshold when sampling happens.
As mentioned before, Google Analytics samples your reports based on the number of sessions. Each version of Google Analytics has a different session limit.
For Universal Analytics, sampling kicks in when your ad hoc reports have 500,000 sessions at the property level for any chosen date range.
Using Google Analytics 360 couldn’t save you from sampling, although it does increase the sampling threshold. You won’t have to worry about sampling unless your ad hoc reports hit 1,000,000 sessions for any chosen date range.
Google Analytics has a query limit of one million rows for a report, regardless of the date range. That’s why when your report crosses this threshold, Google will automatically aggregate the excess rows into a single row called ‘other’.
‘Other’ entries result from high-cardinality dimensions. Cardinality is the number of unique values one dimension can contain. For example, the ‘Mobile’ dimension in Google Analytics only has two values — Yes or No —which means its cardinality is two.
There are certain cardinality limits:
- Daily processed tables. In Universal Analytics, the limit is 50,000 rows. In Google Analytics 360, it’s 75,000 rows.
- Multi-day processed tables. In Universal Analytics, the limit is 100,000 rows. In Google Analytics 360, it’s 150,000 rows.
When you surpass the limit, Google Analytics will group all excess data into ‘other’. High-cardinality dimensions — dimensions that include multiple unique values— are likely to cross the line.
For example, the dimension ‘Page’ can contain multiple values, depending on the URLs on your website. To name a few:
If your report has a high-cardinality dimension, it may lose the crucial data under ‘other’ entry, and you wouldn’t know it unless you got your hands on the raw data.
What about other reports like multi-channel funnel reports and flow-visualization reports?
Well, similarly to ad hoc reports. Multi-channel funnel reports will be sampled when you make any changes to the report. For example, adding a new segment, a new metric, or changing the lookback window. Note that when any customization happens, Google Analytics will return a maximum sample of 1,000,000 conversions.
Since Google Analytics creates flow-visualization reports based on a maximum of 100,000 sessions (for any chosen date range), they’re more likely sampled than other reports.
Data sampling in Google Analytics 4
If you’re looking to switch to Google Analytics 4, it may be worth checking out the sampling situation there as well.
Similar to Universal Analytics, sampling can also occur in Google Analytics 4.
The default reports (under the ‘Reports snapshot’ tab) are never sampled. You’re free to add any secondary dimensions, segments, or filters. The reports will remain unsampled.
Start your 14-day free trial of Supermetrics
See how easy it is to move Google Analytics 4 data to the reporting, analytics, and/or storage platform of choice
However, sampling may occur when you create an advanced analysis, such as cohort analysis, exploration, segment overlap, funnel analysis, etc.
Whenever you have more than 10,000,000 rows and the report you create is not a duplicate of the default report, sampling will kick in.
5 simple workarounds to data sampling in Google Analytics
1. Use default reports or simplify your custom ones
The default reports have a pretty good amount of data, and they’re unsampled. So unless you really need custom reports, you should use the default reports as much as you can.
In case you’re modifying the reports, you can simplify your query to avoid sampling. For example, instead of adding multiple dimensions simultaneously, you can remove the ones you don’t need first.
2. Change your date range
Another quick and easy way to avoid sampling is to shorten your date range. For example, instead of looking at a 6-month period (or whenever your report hits the 500,000 sessions threshold), you can look at a 2-month period.
3. Use Google Analytics 360
With the paid Google Analytics version, your report is free from sampling if it has less than 100 million sessions.
In case you hit the limit, you can still bypass sampling by:
- Downloading an unsampled report. You can export your report and request unsampled reports in Google Analytics 360. However, if your data surpasses three million rows, Google will aggregate exceeding rows into one single row called ‘other’.
- Creating custom tables. You can choose which metrics, dimensions, or segments you want to include in your report. Google will pull unsampled data. In Google Analytics 360, you can set up up to 100 custom tables per property.
4. Export your data to Google Sheets
The Google Analytics API lets you manually pull data into Google Sheets. You can try to export your data in a shorter time frame and assemble and aggregate it later in your spreadsheet.
However, it’s very time-consuming to patch your data together. To make it worse, you may copy the wrong data to the wrong cells here and there.
Later on, we’ll discuss a much more efficient way to get unsampled Google Analytics data to Google Sheets.
5. Use a data warehouse
In case your data is growing rapidly and a spreadsheet can no longer store and process your data, you should think about getting a data warehouse. With a data warehouse, you can easily store granular data from different sources. You can also load your Google Analytics data into a data warehouse to avoid sampling.
If you’re using Google Analytics 360, there’s a native integration you can use to load your data directly into Google BigQuery.
Don’t worry if you don’t use BigQuery. We’ll show you how to load Google Analytics data into any data warehouse in a minute.
How to avoid Google Analytics sampling with Supermetrics
Data partitioning is a great way to bypass sampling. It is the process of dividing data into smaller and more manageable portions. It helps improve query processing performance and scalability. However, it may take a lot of time to partition your data, especially if you’re dealing with an enormous quantity.
That’s where Supermetrics comes in handy.
When we pull data directly from the Google Analytics API, we break up the queries into smaller chunks to avoid sampling. Then, we aggregate the data before generating your report.
Getting unsampled Google Analytics data into spreadsheets
If you’d like to analyze Google Analytics data in a spreadsheet, here’s how you can get unsampled data with Supermetrics. We’re going to demonstrate the tutorial with Google Sheets. However, you can follow the same steps for Excel.
Once you launch the sidebar, you can create your query by filling in these fields:
- Data source
- Select views
- Select dates
- Select metrics
- Split by
To avoid sampling, click on ‘Options’, then choose ‘Add note to query results showing whether Google has used sampling’” and ‘Try to avoid Google’s data sampling’.
Once you’re happy with your query, click on ‘Get data to table’ and wait for Supermetrics to load your data to Google Sheets.
Getting unsampled Google Analytics data into Google Data Studio
If you’d like to get your unsampled Google Analytics data directly to Data Studio, you can also use Supermetrics.
To begin, search for a connector called ‘Google Analytics by Supermetrics’ in the connector gallery.
Then sign in to your Google account and give Supermetrics all the necessary permissions.
After setting up the view where you want to pull data from, click on ‘Try to avoid Google’s data sampling’. Supermetrics will then break your query into multiple sub-queries to avoid sampling. But don’t worry, it’ll automatically patch together the data before it shows up on your Data Studio dashboard.
After that, click on ‘Connect’ to continue.
Getting unsampled data into data warehouses and data lakes
If you use Supermetrics to load Google Analytics data into a data warehouse or a data lake, you’ll usually get unsampled data. Since Supermetrics fetches data one day at a time, your data will remain unsampled for all days with less than 500,000 sessions.
We currently support these destinations:
Your turn 👊
The truth is, sampling is here to stay. You should always beware of it when you’re analyzing your data in Google Analytics.
We hope this post has helped you answer the questions you have about data sampling in Google Analytics. And whenever you need help with getting unsampled data, remember that you can always start your free 14-day trial of Supermetrics.
The post What is Google Analytics sampling and how to avoid it appeared first on Supermetrics.