Time Series

Many advertising measures can often be considered as time series data:

Impressions over time
Click-through rate (CTR)
Ad spend over time
Conversion rates over time

What is a Time Series?

A sequence of data points collected $X_1, X_2, \ldots, X_t$ over time $t$ .

Time series analysis to find the below:
$X_t = T_t + S_t + Y_t$
where:
- $T_t$ : Trend component (long-term movement)
- $S_t$ : Seasonal component (regular patterns)
- $Y_t$ : Irregular component (stationary noise at time $t$ )
- $C_t$ : Cyclical component (long-term cycles) - not shown
Time series forecasting:
$X_t \\ t \in {T+1, T+2, \ldots}$

Characteristics of Time Series Data

Stochastic Process: A collection of random variables indexed by time.
$\{X_t : t \in T\}$
Dependency: There is dependency between random variables at different time points.
- Hence, we need to consider joint distributions, not only marginal distributions.
- Auto-covariance and auto-correlation functions are used to measure this dependency.
Stationarity: A time series is stationary if its statistical properties (mean, variance, autocovariance) do not change over time.
- For non-stationary series, we can often transform them to stationary series using differencing, detrending, nonlinear transformations, etc.

Note

Strict Stationarity

A time series $X_t$ is said to be strictly stationary if the joint distribution of $(X_{t_1}, X_{t_2}, \ldots, X_{t_k})$ is the same as that of $(X_{t_1+h}, X_{t_2+h}, \ldots, X_{t_k+h})$ for all $t_1, t_2, \ldots, t_k$ and all $h$ . The joint distribution of $(X_{t_1}, X_{t_2}, \ldots, X_{t_k})$ is invariant under time shift. This is a strong condition; rarely encountered in practice.

Weak Stationarity

A time series $X_t$ is said to be weakly stationary if its mean and variance are constant over time, and the covariance between $X_t$ and $X_{t+h}$ depends only on the lag $h$ and not on the actual time $t$ .

In practice, weak stationarity is a more common assumption than strict stationarity, as many time series exhibit constant mean and variance but may not have identical joint distributions under time shifts.

Mathematical Properties

Lag-l autocovariance of $X_t$

\gamma(l) = Cov(X_t, X_{t-l})

where:

$\gamma(l)$ : lag-l autocovariance
$\gamma(0)$ : variance of $X_t$
$\gamma(l) = \gamma(-l)$ : symmetry property

Lag-l autocorrelation of $X_t$

\rho(l) = \frac{Cov(X_t, X_{t-l})}{\sqrt{Var(X_t) Var(X_{t-l})}} = \frac{Cov(X_t, X_{t-l})}{Var(X_t)} = \frac{\gamma(l)}{\gamma(0)}

Where the property of $Var(X_t) = Var(X_{t-l})$ for a weakly stationary process.

High autocorrelation at lag $l$ indicates a strong linear relationship between $X_t$ and $X_{t-l}$ , meaning past values have a significant influence on current values.

In general the $lag-l$ autocorrelation of $X_t$ is defined as:

\hat{\rho}(l) = \frac{\sum_{t=l+1}^{T} (X_t - \bar{X})(X_{t-l} - \bar{X})}{\sum_{t=1}^{T} (X_t - \bar{X})^2}

where $0 \leq l \leq T-1$ and $\bar{X}$ is the sample mean of $X_t$ .

Examples of Time Series Models

Choose model based on the characteristics of the time series data.

Autoregressive (AR) model
Moving Average (MA) model
Integrated (I) model
ARMA model
ARIMA model
Seasonal ARIMA (SARIMA) model
Fractional ARIMA (FARIMA) model

Autoregressive (AR) Model

An AR model expresses predicts the variable as a linear regression of the past values of the variable.

For AR model of order $p$ (AR(p)):

\hat{X}_t = a_0 + a_1 X_{t-1} + a_2 X_{t-2} + \ldots + a_p X_{t-p}

where $\hat{X}_t$ is the best estimate of $X_t$ given past values, and $a_0, a_1, \ldots, a_p$ are the model parameters to be estimated from the data.

Error: $e_t = X_t - \hat{X}_t$
Model: $X_t = a_0 + a_1 X_{t-1} +e_t$ (for AR(1))
Minimize the sum of squared errors (SSE): $SSE = \sum_{t=1}^{T} e_t^2$

Assumptions

Linearity: The relationship between the current value and past values is linear.
Normal independent and identically distributed (i.i.d.) errors: The error terms $e_t$ are normally distributed with mean zero and constant variance. (i.e. no autocorrelation)
Additive errors: The error terms are added to the linear combination of past values.
Stationarity: The time series is stationary, meaning its statistical properties do not change over time. ---> most important assumption!!

Backward Shift Operator

The backward shift operator $B$ is defined as:

B X_t = X_{t-1} \\ B^2 X_t = B(B X_t) = B X_{t-1} = X_{t-2} \\ \ldots \\ B^p X_t = X_{t-p}

then the AR(p) model can be written as:

X_t = a_0 + a_1 B X_t + a_2 B^2 X_t + \ldots + a_p B^p X_t + e_t \\ \Rightarrow X_t - a_1 B X_t - a_2 B^2 X_t - \ldots - a_p B^p X_t = a_0 + e_t \\ \Rightarrow \phi(B) X_t = a_0 + e_t

where $\phi(B) = 1 - a_1 B - a_2 B^2 - \ldots - a_p B^p$ is the characteristic polynomial of the AR(p) model.

How to estimate coefficients

Estimated by minimizing the sum of squared errors (SSE) using multiple linear regression.

SSE = \sum e_t^2 = \sum (X_t - \hat{X}_t)^2 = \sum (X_t - a_0 - a_1 X_{t-1} - a_2 X_{t-2} - \ldots - a_p X_{t-p})^2

Minimize SSE to find the optimal coefficients $a_0, a_1, \ldots, a_p$ . This is a multiple linear regression problem.

Determining the order of the AR model

By using the partial autocorrelation function (PACF).

Note

PACF is the autocorrelation between $X_t$ and $X_{t-l}$ after removing the effects of the intermediate lags $1, 2, \ldots, l-1$ . It is the conditional correlation between $X_t$ and $X_{t-l}$ given the values of $X_{t-1}, X_{t-2}, \ldots, X_{t-l+1}$ .

Example:

The first order partial autocorrelation will defined to equal the first order autocorrelation.

The second order (lag) partial autocorrelation is:

\frac{Cov(X_t, X_{t-2} | X_{t-1})}{\sqrt{Var(X_t | X_{t-1}) Var(X_{t-2} | X_{t-1})}}

where $\hat{X}_t = a_0 + a_1 X_{t-1}$ and $\hat{X}_{t-2} = a_0 + a_1 X_{t-3}$ are the linear regression estimates of $X_t$ and $X_{t-2}$ on $X_{t-1}$ and $X_{t-3}$ respectively.

The third order (lag) partial autocorrelation is:

\frac{Cov(X_t, X_{t-3} | X_{t-1}, X_{t-2})}{\sqrt{Var(X_t | X_{t-1}, X_{t-2}) Var(X_{t-3} | X_{t-1}, X_{t-2})}}

where $\hat{X}_t = a_0 + a_1 X_{t-1}$ and $\hat{X}_{t-3} = a_0 + a_1 X_{t-4}$ are the linear regression estimates of $X_t$ and $X_{t-3}$ on $X_{t-1}$ and $X_{t-4}$ respectively.

Moving Average (MA) Model

An MA model expresses the variable as a linear combination of past error terms.

For MA model of order $q$ (MA(q)):

\hat{X}_t = b_0 + b_1 e_{t-1} + b_2 e_{t-2} + \ldots + b_q e_{t-q}

For order 1 (MA(1)):

\hat{X}_t = b_0 + b_1 e_{t-1}

MA(0) = AR(0) = white noise = $e_t$ = $X_t - a_0$

Where, $a_0$ is the mean of the time series.

Backward shift operator B

Using the backward shift operator $B$ , the MA(q) model can be written as:

X_t - a_0 = b_1 B e_t + b_2 B^2 e_t + \ldots + b_q B^q e_t + e_t \\ \Rightarrow X_t - a_0 = (1 + b_1 B + b_2 B^2 + \ldots + b_q B^q) e_t \\ \Rightarrow X_t - a_0 = \psi(B) e_t \\

Autocorrelation

For MA(1):

\rho(l) = \begin{cases} 1 & l = 0 \\ \frac{b_1}{1 + b_1^2} & l = 1 \\ 0 & l > 1 \end{cases}

The autocorrelation of MA(q) is non-zero only for lags up to $q$ .

Determining the order of the MA model

The order of the last significant autocorrelation $\rho(l)$ determines the order of the MA model.

Integrated (I) Model

Used for non-stationary time series data to make it stationary by differencing.

White noise: $X_t - a_0 = e_t$ , but the mean is independent of time.

If the time series is parabolic, the second difference can be modeled as white noise (I(2)):

Two levels of differencing.

(X_t - X_{t-1}) - (X_{t-1} - X_{t-2}) = a_0 + e_t \\ (1 - B)^2 X_t = a_0 + e_t

ARMA and ARIMA Models

It is possible to combine AR, MA, and I components into a single model.

ARMA(p, q) Model

Combines AR(p) and MA(q) models for stationary time series:

\phi(B) X_t = a_0 + \psi(B) e_t

Where:

$\phi(B) = 1 - a_1 B - a_2 B^2 - \ldots - a_p B^p$ (AR part)
$\psi(B) = 1 + b_1 B + b_2 B^2 + \ldots + b_q B^q$ (MA part)

ARIMA(p, d, q) Model

For non-stationary time series, we can use differencing to make it stationary and then apply ARMA:

(1 - B)^d X_t = Y_t \\ \phi(B) Y_t = a_0 + \psi(B) e_t \\ \Rightarrow \phi(B) (1 - B)^d X_t = a_0 + \psi(B) e_t

Where:

$d$ : order of differencing