How to fill missing data in BigQuery using LAST_VALUE function

LAST_VALUE function explained

In this article, you will learn about analytic functions and also about how to use function LAST_VALUE to fill missing data in BigQuery.

A few words about an analytic function

An analytic function takes a bunch of rows to create a group in order to compute a single value for each row.
This is where this kind of function differs from an aggregate one, which returns a single result for a group of rows.

Let’s learn by doing…

Let’s suppose we have a database that contains information about products and the corresponding prices for each day.
Well, from time to time there is an issue uploading the data and we can observe empty prices for our products:

As we can see on the above screenshot, we do not have a price for mango for 20210205 and 20210208. Next to that, we are also missing price information for an apple for dates: 20210206 and 20210208.

We need all the values to be filled for our report that shows prices for each product per day. It would be satisfying for us to fill the missing values with the first, not empty preceding values. How can we achieve that?

Here’s where the LAST_VALUE function comes in handy!

Take a look at the following query.

SELECT
*
FROM (
SELECT
product,
date,
price,
LAST_VALUE(price IGNORE NULLS)
OVER
(PARTITION BY product
ORDER BY date
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS last_value
FROM
product_prices
GROUP BY
1,
2,
3
ORDER BY
product,
date DESC)

Let’s break the above query into parts.

  • `LAST_VALUE (price IGNORE NULLS)`:

Function’s LAST_VALUE syntax definition:

LAST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])

We use this function to get the last value in the specified window (group).
Please note that by using IGNORE NULLS we are choosing the last filled value (according to our sorting configuration).

  • `PARTITION BY product`:

The above statement informs the query engine that we want to execute our function for each defined partition: we do want to run the LAST_VALUE function for each of our products.

  • `ORDER BY date`:

A very important part of the above query. Without sorting the list of dates for our product we are not able to tell that easily which row is representing the closest (time-wise) filled price.

Let’s have a look one more time on our database before applying the function:

We can tell immediately which price we want to have in row 2 for example. That’s right: 2.3 which is the price from the previous day.

  • RANGE statement
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

We use the keyword RANGE to limit our partitions referencing the values. Unlike as with the keyword ROWS where we reference rows. In the above example, we say “please look at all preceding rows to define the last value”.

Result

As you can see on the above screenshot, the last column named last_value represents the nearest filled value.
Please note that in case of the row 1 where we do not have a preceding value, the function returns the current value.

Data Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store