Logo

Hyperparameter Optimization

September 16, 2025
10 min read
Summary (Summary of Optimization in MMM)

Takes around 3 optimization steps

  1. Variable transformation (adstock, saturation) → requires optimization of 3 hyperparameters.
    1. Adstock = Captures the delayed effect of advertising over time. How long does the effect of an ad last?
      • hyperparameter: decay rate
    2. Saturation = Accounts for diminishing returns on ad spend. How quickly do additional ad spends yield lower returns?
      • hyperparameters: saturation point, saturation rate.
  2. Multi-objective optimization minimizing error between MMM predictions and ground truth.
  3. Optimal budget allocation.

Model Parameter

  • Configuration variable that is internal to the model and whose value can be estimated from data.
  • e.g., weights in linear regression, weights and biases in neural networks.

Characteristics

  • They are required by the model when making predictions.
  • They values define the skill of the model on your problem.
  • They are estimated or learned from data.
  • They are often not set manually by the practitioner.
  • They are often saved as part of the learned model.

Hyperparameter

  • Configuration variable that is external to the model and whose value cannot be estimated from data.
  • e.g., learning rate, number of hidden layers, number of clusters.

Characteristics

  • They are often used in the process to help estimate model parameters.
  • They can often be set using heuristics.
  • They are often tuned for a given predictive modeling problem.

Bilevel optimization

minθΩ(w(θ);Dval)s.t.w(θ)=argminwWL(w;θ,Dtrain)\begin{align*} \min_{\theta \in \Omega} \quad & \ell(w^*(\theta); \mathcal{D}_{\text{val}}) \\ \text{s.t.} \quad & w^*(\theta) = \arg\min_{w \in \mathcal{W}} L(w; \theta, \mathcal{D}_{\text{train}}) \end{align*}

Bi-level Optimization has a nested structure where one optimization problem is contained within another. It consists of an upper-level problem and a lower-level problem, with the solution of the upper-level problem depending on the solution of the lower-level problem.

  • θ\theta: Hyperparameter. The variable we want to optimize (e.g., learning rate, number of hidden units, regularization strength).
  • Ω\Omega: The search space where the hyperparameter θ\theta can exist.
  • ww: Model Parameter. The values that the model learns through training (e.g., weights and biases of a neural network).
  • D=(Dtrain,Dval)D = (D_{\text{train}}, D_{\text{val}}): The dataset, split into
    • DtrainD_{\text{train}}: Data used to train the model parameters ww.
    • DvalD_{\text{val}}: Data used to (1) evaluate the generalization performance of the trained model and (2) to find the optimal hyperparameter θ\theta.
  • LL: Training loss function. The criterion used to optimize ww during training, measuring how well the model predicts DtrainD_{\text{train}}.
  • \ell: Validation loss function. The criterion used to evaluate the quality of the hyperparameter θ\theta, measuring how well the model trained on DtrainD_{\text{train}} performs on unseen data DvalD_{\text{val}}.

Example

minθ[5,5](θ,w)=θ2+(w)2w=argminwRL(θ,w)=(θw)2s.t.w[5,5]\begin{align*} \min_{\theta \in [-5, 5]} \ell(\theta, w^*) = \theta^2 + (w^*)^2 \\ w^* = \arg\min_{w \in \mathbb{R}} L(\theta, w) = (\theta - w)^2 \\ {s.t.} \quad w \in [-5, 5] \end{align*}
  1. Select a feasible hyperparameter θ\theta from the search space Ω=[5,5]\Omega = [-5, 5]. For example, let θ=3\theta = 3.
  2. Solve the lower-level problem to find the optimal model parameter ww^* given the hyperparameter θ=3\theta = 3.
    • solve min(θw)2min(\theta - w)^2 with respect to ww.
    • The lower-level problem is to minimize the training loss L(θ,w)=(3w)2L(\theta, w) = (3 - w)^2 with respect to ww.
    • The optimal solution is w=3w^* = 3.
  3. Evaluate the upper-level objective function (θ,w)\ell(\theta, w^*) using the obtained ww^*.
    • The upper-level problem is to minimize the validation loss (θ,w)=θ2+(w)2\ell(\theta, w^*) = \theta^2 + (w^*)^2.
    • Substituting θ=3\theta = 3 and w=3w^* = 3, we get (3,3)=32+32=18\ell(3, 3) = 3^2 + 3^2 = 18.
Definition (What is Discretization?)

Discretization: The process of transforming continuous variables or functions into discrete counterparts.

  • Simplifies the search space, making it easier to explore.
  • Efficient Search Strategies
  • Computational Constraints
  • Easier to Parallelize Evaluations.

Grid Search is a method for finding the optimal hyperparameters by exhaustively trying all possible combinations based on a predefined grid of hyperparameter values.

  • Most basic hyperparameter optimization method.
  • Full factorial design
    • Evaluates all possible combinations of hyperparameters.
    • Guarantees finding the global optimum if given enough resources.
    • curse of dimensionality: Computational cost increases exponentially with the number of hyperparameters.

It is the cartesian product of hyperparameter values.

Grid={(θ1,θ2,...,θk)θiVi}=V1×V2×...×Vk\text{Grid} = \{(\theta_1, \theta_2, ..., \theta_k) | \theta_i \in V_i\} = V_1 \times V_2 \times ... \times V_k

When iterations are limited, each trial has the opportunity to explore new values for each hyperparameter, increasing the chances of finding the optimal values for the important parameters.

Random Search is a method for finding the optimal hyperparameters by randomly sampling combinations of hyperparameter values from predefined ranges.

  • Works better than Grid Search when some hyperparameters are more important than others. (Which is true in most cases)
  • But you don’t know how many samples you need to draw to find a good combination.

Bi-level Black-Box Optimization Problem

Definition (What is Black Box Optimization?)

Black Box: A system or function where the internal workings are not known or accessible. You can only observe the inputs and outputs.

Black-box optimization methods obtain the optimal results without needing to understand the internal mechanics.

  • The overall shape of the upper-level objective function (θ)\ell(\theta) is unknown.
  • Derivative-free optimization method: gradients and hessians are not needed.

Common Black-Box Optimization Methods

  • Bayesian Optimization:
    • Builds a probabilistic model of the objective function and uses it to select the most promising hyperparameters to evaluate next.
    • This can be considered as a surrogate-based derivative-free optimization algorithm.
  • Evolutionary Algorithms:
    • Mimics the process of natural selection to evolve a population of candidate solutions. (aka. Genetic Algorithms)
  • Random Search:
    • Randomly samples hyperparameter combinations from predefined distributions.
  • Surrogate-based Derivative-Free Optimization Algorithms:
    • Uses a surrogate model to approximate the objective function and guide the search for optimal hyperparameters.

Surrogate-based Derivative-Free Optimization Algorithm in Hyperparameter Optimization

  1. Prepare hyperparameters for optimization

    • Map the finite set of possible values using sequences of integers.
    • Domain denoted as Ω\Omega.
  2. Initial design

    • Create an initial experimental design with n0n_0 samples by random sampling from Ω\Omega.
    • For each sampled candidate θjΩ\theta_j \in \Omega, compute the upper level objective function value ((θj)\ell(\theta_j)) by solving the lower-level problem
    • e.g., training and validating a model with hyperparameter θj\theta_j.
  3. Set nn0n \leftarrow n_0

    • Record the number of evaluations.
  4. Adaptive sampling

    • Iteratively select new candidates θnew\theta^{new} to evaluate based on previous results.
  5. Surrogate model fitting

    • Use the data (θj,(θj))(\theta_j, \ell(\theta_j)) obtained so far to fit a surrogate model.
    • This surrogate model is a cheaper approximation of the expensive objective function.
    • methods:
      • Gaussian Processes (GP): Probabilistic model that provides uncertainty estimates.
      • Radial Basis Functions (RBF): Approximate response surface of the objective function.
  6. Acquisition function optimization

    • Optimize the acquisition function using the surrogate model.
    • This optimization process identifies the most promising candidate θnew\theta^{new} to evaluate next.
    • Exploration vs. Exploitation trade-off:
      • Exploration: Searching new areas of the hyperparameter space.
      • Exploitation: Focusing on areas known to yield good results.
  7. Lower-level problem solving

    • Solve the lower-level problem for the newly selected hyperparameter θnew\theta^{new}.
    • Obtain the objective function value (θnew)\ell(\theta^{new}).
  8. Update

    • Update nn+1n \leftarrow n + 1
    • Go back to Step 5.
  9. Stopping criterion

    • Stop if the termination condition is met.
    • e.g.: max number of evaluations, time budget, convergence criteria.

Code Practice

minx,yf(x,y)=x2+y2subject to 0x10,y(x)argmin0yx(y5)2\min_{x,y} f(x,y) = x^2 + y^2 \\ \text{subject to } 0 \leq x \leq 10, \\ y(x) \in \text{argmin}_{0 \leq y \leq x} (y - 5)^2
import random
import pyomo.environ as pyo
import numpy as np
import matplotlib.pyplot as plt
random.seed(21)
# lower level problem solve for y given x
def solve_lower_level(x):
model = pyo.ConcreteModel()
model.y = pyo.Var(within=pyo.NonNegativeReals, bounds=(0, x))
model.obj = pyo.Objective(expr=(model.y - 5)**2, sense=pyo.minimize)
solver = pyo.SolverFactory('ipopt')
solver.solve(model)
return model.y.value
# evaluate upper level optimization problem
def evaluate_upper_level(x, y):
return x**2 + y**2
# test functions
for i in [0, 5, 10]:
print(f"Optimal y for x={i}: {solve_lower_level(i)}")
for i in [0, 5, 10]:
print(f"Upper level objective for x={i}, y={solve_lower_level(i)}: {evaluate_upper_level(i, solve_lower_level(i))}")
# iterate randomly to find optimal x, y
def upper_level_random_search(num_iterations=1000):
best_x = None
best_y = None
min_objective_value = float('inf')
for i in range(num_iterations):
x_candidate = random.uniform(0, 10)
y_candidate = solve_lower_level(x_candidate)
objective_value = evaluate_upper_level(x_candidate, y_candidate)
if objective_value < min_objective_value:
min_objective_value = objective_value
best_x = x_candidate
best_y = y_candidate
print(f"Iteration {i+1}: New best found -> x = {best_x:.4f}, y = {best_y:.4f}, f(x,y) = {min_objective_value:.4f}")
print("\nOptimal solution:")
print(f" x = {best_x:.4f}")
print(f" y = {best_y:.4f}")
print(f" Minimum f(x, y) = {min_objective_value:.4f}")
return best_x, best_y, min_objective_value
upper_level_random_search()

Output:

Iteration 1: New best found -> x = 1.6495, y = 1.6495, f(x,y) = 5.4417
Iteration 11: New best found -> x = 0.0318, y = 0.0318, f(x,y) = 0.0020
Optimal solution:
x = 0.0318
y = 0.0318
Minimum f(x, y) = 0.0020

Visualization

# generate data
x = np.random.rand(100)
y = 2 * x + 1 + np.random.normal(0, 0.1, 100) # y = 2x + 1 + noise
plt.scatter(x, y)
# split data
split_index = int(0.8 * len(x))
x_train, x_val = x[:split_index], x[split_index:]
y_train, y_val = y[:split_index], y[split_index:]

Output:

scatter plot

minλ(θ,ω;Dval)s.t. 0λ1y(x)argminωWL(ω;λ,Dtrain)\min_{\lambda} \ell(\theta, \omega^*; D_{val}) \\ \text{s.t. } 0 \leq \lambda \leq 1 \\ y(x) \in \text{argmin}_{\omega \in W} L(\omega; \lambda, D_{train}) \\

Use the following assumptions to generate the dataset:

  • True model: y=2x+1y = 2x + 1
  • xx: Generate 100 random xx values.
  • yy: Generate 100 yy values with the above xx and add error with a normal distribution.
    • e.g. y=2x+1+errory = 2x + 1 + \text{error}
  • Use the same loss function on the upper and lower level.
    • Ridge regression loss: Σ(yy^)2+λ×m2,where y^=mx+b\Sigma (y - \hat{y})^2 + \lambda \times m^2, \text{where} \space \hat{y} = mx + b
def solve_lower_level(lambda_value, x_train, y_train):
model = pyo.ConcreteModel()
model.m = pyo.Var(within=pyo.Reals)
model.b = pyo.Var(within=pyo.Reals)
def obj_func(model):
return sum((y_train[i] - (model.m * x_train[i] + model.b))**2 for i in range(len(x_train))) + lambda_value * (model.m**2)
model.obj = pyo.Objective(expr=obj_func(model), sense=pyo.minimize)
solver = pyo.SolverFactory('ipopt')
solver.solve(model)
return model.m.value, model.b.value
solve_lower_level(0.1, x_train, y_train)
# Output: (1.9602579300437144, 1.013876477469167)
def upper_level_random_search(num_iterations=1000):
best_lambda = None
best_m = None
best_b = None
min_val_loss = float('inf')
for i in range(num_iterations):
lambda_candidate = random.uniform(0, 1)
m_candidate, b_candidate = solve_lower_level(lambda_candidate, x_train, y_train)
val_loss = np.mean([(y_val[j] - (m_candidate * x_val[j] + b_candidate))**2 for j in range(len(x_val))]) + lambda_candidate * (m_candidate**2)
if val_loss < min_val_loss:
min_val_loss = val_loss
best_lambda = lambda_candidate
best_m = m_candidate
best_b = b_candidate
print(f"Iteration {i+1}: New best -> λ = {best_lambda:.4f}, m = {best_m:.4f}, b = {best_b:.4f}, Val Loss = {min_val_loss:.4f}")
print("\nOptimal solution:")
print(f" lambda = {best_lambda:.4f}")
print(f" m = {best_m:.4f}")
print(f" b = {best_b:.4f}")
print(f" Minimum Validation Loss = {min_val_loss:.4f}")
return best_lambda, best_m, best_b, min_val_loss
upper_level_random_search()

Output:

Iteration 1: New best -> λ = 0.4744, m = 1.8421, b = 1.0714, Val Loss = 1.6210
Iteration 3: New best -> λ = 0.2508, m = 1.9109, b = 1.0379, Val Loss = 0.9258
Iteration 7: New best -> λ = 0.2438, m = 1.9131, b = 1.0368, Val Loss = 0.9022
Iteration 9: New best -> λ = 0.2171, m = 1.9217, b = 1.0327, Val Loss = 0.8117
Iteration 10: New best -> λ = 0.1569, m = 1.9413, b = 1.0231, Val Loss = 0.6010
Iteration 20: New best -> λ = 0.1438, m = 1.9456, b = 1.0210, Val Loss = 0.5542
Iteration 22: New best -> λ = 0.0404, m = 1.9805, b = 1.0040, Val Loss = 0.1681
Iteration 54: New best -> λ = 0.0089, m = 1.9913, b = 0.9987, Val Loss = 0.0452
Iteration 119: New best -> λ = 0.0001, m = 1.9944, b = 0.9973, Val Loss = 0.0103
Optimal solution:
lambda = 0.0001
m = 1.9944
b = 0.9973
Minimum Validation Loss = 0.0103