Logo

Surrogate model-based Derivative-Free Optimization

September 18, 2025
11 min read

Surrogate model

(θ)=m(θ)+e(θ)\ell(\theta) = m(\theta) + e(\theta)

Low-fidelity model that approximates a high-fidelity (expensive) objective function. Approximates the costly objective function to reduce computational cost.

Definition

Fidelity: The degree to which a model accurately represents the real-world process it is intended to simulate.

  • (θ)\ell(\theta): upper level objective (actual objective, expensive)
  • m(θ)m(\theta): surrogate model (approximation, cheap)
  • e(θ)e(\theta): noise

Optimization steps

If some hyperparameter θ\theta is selected, the model can be trained to obtain the test objective function value (θ,w(ζ),Dtest)\ell(\theta, w^*(\zeta), \mathcal{D}_{\text{test}}).

However, this process is time-consuming and subject to noise due to training randomness ζ\zeta.

Therefore, instead of directly optimizing (θ)\ell(\theta), the goal is to create a surrogate model based on a few computed results and use it for more efficient optimization.

  1. Approximate a function
  2. Select next hyperparameter

Radial Basis Functions (RBF) + Stochastic Sampling

mRBF(θ)=i=1Nλiϕ(θθi)+p(θ)m_{RBF}(\theta) = \sum_{i=1}^{N} \lambda_i \phi(||\theta - \theta_i||) + p(\theta)
  • euclidean distance is being used
  • radial = distance from a center point
  • θi\theta_i: center points (previously evaluated hyperparameters)
  • i\ell_i: average (expected) test objective function values at those points
  • ϕ\phi: function that depends only on the distance from the center (ex: Gaussian kernel, r3r^3, etc.)
  • λi\lambda_i: coefficients that determine the influence of each center θi\theta_i
  • p(θ)=b0+bθp(\theta) = b_0 + b^\top \theta: linear polynomial tail (linear trend representing the overall trend)

Conditions

For the surrogate model to work properly, the following two conditions must be satisfied:

  1. Interpolation Condition: The surrogate model must match the actual experimental values at all experimental points.

    mRBF(θi)=i,i=1,,Nm_{\mathrm{RBF}}(\theta_i) = \ell_i, \quad i=1,\dots,N
  2. Trend Separation Condition: To prevent the RBF part from mixing with the linear trend,

    PTλ=0P^T \lambda = 0

Solving for coefficients

For parameters λ1,...,λN\lambda_1, ..., \lambda_N and β=[b0,b1,...,bd]T\beta = [b_0, b_1, ..., b_d]^T determined by solving a linear system of equations:

[ΦPPT0][λb]=[0]\begin{bmatrix} \Phi & P \\ P^T & 0 \end{bmatrix} \begin{bmatrix} \lambda \\ b \end{bmatrix} = \begin{bmatrix} \ell \\ 0 \end{bmatrix}
  • To solve for \ell, the matrix on the left needs to be invertible.
    • This happens if and only if the rank(P)=d+1rank(P) = d+1, where dd is the number of hyperparameters (dimension of θ\theta).

where:

Φij=ϕ(θiθj),P=[θ1T1θ2T1θNT1],b=[bb2bdb0],=[Eζ ⁣[(θ1,w1(ζ),Dtest)]Eζ ⁣[(θ2,w2(ζ),Dtest)]Eζ ⁣[(θn,wn(ζ),Dtest)]]\Phi_{ij} = \phi(||\theta_i - \theta_j||), \quad P = \begin{bmatrix} \theta_1^T & 1 \\ \theta_2^T & 1 \\ \vdots & \vdots \\ \theta_N^T & 1 \end{bmatrix}, \quad b = \begin{bmatrix} b \\ b_2 \\ \vdots \\ b_{d} \\ b_0 \end{bmatrix}, \quad \ell = \begin{bmatrix} \mathbb{E}_{\zeta}\!\left[ \ell(\theta_1, w_1^*(\zeta), \mathcal{D}_{\text{test}}) \right] \\[6pt] \mathbb{E}_{\zeta}\!\left[ \ell(\theta_2, w_2^*(\zeta), \mathcal{D}_{\text{test}}) \right] \\[6pt] \vdots \\[6pt] \mathbb{E}_{\zeta}\!\left[ \ell(\theta_n, w_n^*(\zeta), \mathcal{D}_{\text{test}}) \right] \end{bmatrix}
  • Φij=ϕ(θiθj)\Phi_{ij} = \phi(||\theta_i - \theta_j||): RBF values based on distances between points
  • PP: matrix formed by appending a column of ones to the coordinates of each point
  • \ell: upper level evaluation; vector of average objective function values obtained at each experimental point

Solving this system yields the coefficients λ\lambda and bb needed to construct the surrogate model mRBF(θ)m_{RBF}(\theta).

Optimization Loop

  1. Choose a few hyperparameters θi\theta_i and compute the actual objective function values i\ell_i.
  2. Solve the above system to build the mRBFm_{\mathrm{RBF}} surrogate model.
  3. Based on this model, select a new θ\theta.
  4. Evaluate that θ\theta in practice and return to step 2.

How to select the next θnew\theta^{new} to evaluate?

The next evaluation point θnew\theta^{new} is selected by balancing exploitation (choosing points with low predicted values) and exploration (choosing points in less explored areas).

  1. Local Search: Generate MM candidates by ±\pm(small random value) around the best point found so far.
  2. Global Search: Randomly sample MM candidates from the entire search space.
  3. For each 2M candidates, compute two scores:
    • RBF Score: The predicted value from the surrogate model mRBF(θ)m_{RBF}(\theta).
    sRBF(xi)=j=1nλ^jϕ(xiθj)+p(xi)s_{\text{RBF}}(x_i) = \sum_{j=1}^n \hat{\lambda}_j \phi(\lVert x_i - \theta_j \rVert) + p(x_i)
    • Distance Score: The minimum distance to any previously evaluated point.
    Δ(xi,Θ)=minj=1,,nxiθj \Delta(x_i, \Theta) = \min_{j=1,\dots,n} ||x_i - \theta_j||
  4. Scale both scores to the range [0, 1]. VΔ(xi)=ΔmaxΔ(xi,Θ)ΔmaxΔmin,Vs(xi)=sRBF(xi)sminsmaxsmin V_\Delta(x_i) = \frac{\Delta_{\max} - \Delta(x_i,\Theta)}{\Delta_{\max} - \Delta_{\min}}, \quad V_s(x_i) = \frac{s_{\text{RBF}}(x_i) - s_{\min}}{s_{\max} - s_{\min}}
  5. Compute weighted scores V(xi)=ωVΔ(xi)+(1ω)Vs(xi)V(x_i) = \omega V_\Delta(x_i) + (1-\omega) V_s(x_i)
  6. Select the candidate with the lowest weighted score as the next evaluation point θnew\theta^{new}.

This method ensures a balance between exploring new areas and exploiting known good areas in the hyperparameter space.

  • RBF score focuses on areas predicted to be good (exploitation).
  • Distance score encourages sampling in less explored areas (exploration).
Note (Stochastic Sampling)

\ell vector inside Eζ[]\mathbb{E}_\zeta[\cdot]

In practice, since the expectation Eζ[]\mathbb{E}_\zeta[\cdot] cannot be directly known, the same hyperparameter θi\theta_i is run multiple times and averaged.

i1Mim=1Mi(θi,wi(ζi,m),Dtest)\ell_i \approx \frac{1}{M_i}\sum_{m=1}^{M_i}\ell(\theta_i, w_i^*(\zeta_{i,m}), \mathcal{D}_{\text{test}})
  • This helps reduce noise caused by randomness.
  • For important candidate points, the number of repetitions can be increased to estimate the average more accurately.
  • If the noise is severe, a “regression” approach can be used instead of “interpolation” to allow for some deviation (e.g., ridge regression).

Code Examples

import numpy as np
# -------------------------
# Example Objective Function
# -------------------------
def objective(x):
# quadratic function with noise
return np.sum(np.array(x)**2) + np.random.randn()*0.1
# -------------------------
# RBF Surrogate
# -------------------------
class RBF_Surrogate:
def __init__(self, phi="cubic"):
if phi == "gaussian":
self.phi = lambda r: np.exp(-(r**2))
elif phi == "linear":
self.phi = lambda r: r
else: # cubic
self.phi = lambda r: r**3
def fit(self, X, y):
self.X = np.array(X)
self.y = np.array(y)
N, d = self.X.shape
Phi = np.zeros((N, N))
for i in range(N):
for j in range(N):
Phi[i, j] = self.phi(np.linalg.norm(self.X[i] - self.X[j]))
P = np.hstack([self.X, np.ones((N, 1))])
A = np.block([
[Phi, P],
[P.T, np.zeros((d+1, d+1))]
])
b = np.concatenate([self.y, np.zeros(d+1)])
sol = np.linalg.solve(A, b)
self.lmbda = sol[:N]
self.beta = sol[N:]
def predict(self, X_new):
X_new = np.atleast_2d(X_new)
y_pred = np.zeros(len(X_new))
for k, xk in enumerate(X_new):
rbf_sum = 0
for i in range(len(self.X)):
r = np.linalg.norm(xk - self.X[i])
rbf_sum += self.lmbda[i] * self.phi(r)
trend = np.dot(self.beta[:-1], xk) + self.beta[-1]
y_pred[k] = rbf_sum + trend
return y_pred
# -------------------------
# Data Storage
# -------------------------
class DataStore:
def __init__(self, dim, m_init=5, bounds=(-2, 2)):
self.dim = dim
self.bounds = bounds
self.m = m_init
self.S = np.random.uniform(bounds[0], bounds[1], size=(m_init, dim))
self.Y = np.array([objective(s) for s in self.S])
self.rbf = RBF_Surrogate()
self.update_rbf()
def update_rbf(self):
self.rbf.fit(self.S, self.Y)
def add_point(self, x):
y = objective(x)
self.S = np.vstack([self.S, x])
self.Y = np.append(self.Y, y)
self.m += 1
self.update_rbf()
# -------------------------
# Candidate Selection (RBF Score + Distance Score)
# -------------------------
def RBF_score(candidates, data, omega=0.5):
s = data.rbf.predict(candidates)
dist = np.array([np.min(np.linalg.norm(data.S - c, axis=1)) for c in candidates])
# 0~1 scaling
s_scaled = (s - s.min()) / (s.max() - s.min() + 1e-8)
d_scaled = (dist.max() - dist) / (dist.max() - dist.min() + 1e-8)
score = omega*d_scaled + (1-omega)*s_scaled
return score
# -------------------------
# Optimization Loop
# -------------------------
def run_rbf(dim=2, n_iter=10):
np.random.seed(0)
data = DataStore(dim=dim)
for it in range(n_iter):
best_idx = np.argmin(data.Y)
best_x = data.S[best_idx]
# generate candidates
local_candidates = best_x + np.random.uniform(-0.5, 0.5, size=(20, dim))
global_candidates = np.random.uniform(data.bounds[0], data.bounds[1], size=(20, dim))
candidates = np.vstack([local_candidates, global_candidates])
# calculate scores
score = RBF_score(candidates, data, omega=0.5)
# select next point
next_x = candidates[np.argmin(score)]
data.add_point(next_x)
print(f"Iter {it}: best_y={data.Y.min():.4f}, next_x={next_x}")
return data
# execute
if __name__ == "__main__":
result = run_rbf(dim=2, n_iter=10)

Gaussian Process (GP) + Expected Improvement (EI)

Gaussian Process (GP) is another popular surrogate model. It uses the mean and covariance structure to model the function (θ)\ell(\theta) probabilistically.

mGP(θ)=μ+Z(θ)m_{GP}(\theta) = \mu + Z(\theta)
  • μ\mu: mean of the stochastic process (constant)
  • Z(θ)Z(\theta): GP’s randomness - Z(θ)N(0,σ2)Z(\theta) \sim \mathcal{N}(0, \sigma^2)
  • Z(θ)Z(\theta) at different locations are correlated based on distance
Note
  • Mean mGP(θnew)m_{GP}(\theta^{new}): Predicted value of (θnew)\ell(\theta^{new}) based on observed data (most likely value, expectation).
  • Variance s2(θnew)s^2(\theta^{new}): Indicates how uncertain the prediction is.
    • Near observed data points, variance is very small (almost 0), meaning the prediction is very certain.
    • In unexplored regions far from observed data points, variance is large, meaning the prediction is very uncertain.

Correlation Structure

Correlation is defined using a kernel function, based on the assumption that “close inputs will have similar outputs”:

Corr(Z(θk),Z(θl))=exp(i=1dγiθk(i)θl(i)qi)\text{Corr}(Z(\theta_k), Z(\theta_l)) = \exp\left(-\sum_{i=1}^d \gamma_i \lvert \theta_k^{(i)} - \theta_l^{(i)} \rvert^{q_i}\right)
  • dd: dimension of hyperparameters
  • γi\gamma_i: scaling parameter for each dimension (estimated via maximum likelihood estimation)
  • qiq_i: smoothness parameter (determines the shape of the kernel, usually 1 or 2)

Given data {(θi,i)}i=1n\{(\theta_i, \ell_i)\}_{i=1}^n, the mean and variance at a new point θnew\theta^{new} can be estimated as follows:

μ^=1TR11TR11,σ^2=(1μ^)TR1(1μ^)n\hat{\mu} = \frac{1^T R^{-1} \ell}{1^T R^{-1} 1}, \quad \hat{\sigma}^2 = \frac{(\ell - 1 \hat{\mu})^T R^{-1} (\ell - 1 \hat{\mu})}{n}
  • RR: n×nn\times n correlation matrix (Rkl=Corr(Z(θk),Z(θl))R_{kl} = \text{Corr}(Z(\theta_k), Z(\theta_l)))
  • \ell: upper level objective vector (actual observed values)

Predictions at a new point θnew\theta^{new}

Prediction based on the weighted average of existing observations.

mGP(θnew)=μ^+rTR1(1μ^)m_{GP}(\theta^{new}) = \hat{\mu} + r^T R^{-1} (\ell - 1\hat{\mu}) r=[Corr(Z(θnew),Z(θ1))Corr(Z(θnew),Z(θn))]r = \begin{bmatrix} \text{Corr}(Z(\theta^{new}), Z(\theta_1)) \\ \vdots \\ \text{Corr}(Z(\theta^{new}), Z(\theta_n)) \end{bmatrix}
  • rTR1r^T R^{-1}: weight part.
  • rr: correlation vector between θnew\theta^{new} and existing points
  • RR: correlation matrix among existing points.

Hence, the points of i\ell_i closer(more correlated) to θnew\theta^{new} have a higher weight in determining the prediction.

Uncertainty estimation at a new point

s2(θnew)=σ^2(1rTR1r+(11TR1r)21TR11)s^2(\theta^{new}) = \hat{\sigma}^2 \left( 1 - r^T R^{-1} r + \frac{(1 - 1^T R^{-1} r)^2}{1^T R^{-1} 1} \right)
  • rTR1rr^T R^{-1} r: measures how well θnew\theta^{new} is explained by existing points.
    • If θnew\theta^{new} is very close to existing points (well-explored area), this value approaches 1, making the overall variance s2(θnew)s^2(\theta^{new}) close to 0 (low uncertainty).
    • Conversely, if θnew\theta^{new} is far from existing points (unexplored area), this value approaches 0, leading to higher variance (high uncertainty).
Summary (Final optimization loop)
  1. Compute (θ)\ell(\theta) at a few initial points.
  2. Train the GP model to estimate mGP(θ)m_{GP}(\theta) and s(θ)s(\theta).
  3. Compute Expected Improvement E[I]E[I] for candidates θ\theta.
  4. Select θnew=argmaxθE[I]\theta^{new} = \arg\max_\theta E[I].
  5. Evaluate in practice and add to dataset → go back to step 2.

Expected Improvement (EI)

Find the optimal θ\theta that maximizes E[I]E[I] using genetic algorithms, etc.

To select the next evaluation point, we consider the expected improvement over the current best.

  1. Current best objective function value:

    best=(θbest)\ell^{best} = \ell(\theta^{best})
  2. Improvement at a candidate point θ\theta:

    I=bestL,LN(mGP(θ),s2(θ))I = \ell^{best} - L, \quad L \sim \mathcal{N}(m_{GP}(\theta), s^2(\theta))
    • LL: probabilistic variable reflecting uncertainty
      • a value following a normal distribution with mean mGP(θ)m_{GP}(\theta) and variance s2(θ)s^2(\theta)
  3. Expected Improvement:

    E[I]=s(θ)(ZΦ(Z)+ϕ(Z))whereZ=bestmGP(θ)s(θ)E[I] = s(\theta) \left( Z \cdot \Phi(Z) + \phi(Z) \right) \quad \text{where} \quad Z = \frac{\ell^{best} - m_{GP}(\theta)}{s(\theta)}
    • Φ\Phi: Cumulative Distribution Function (CDF)
    • ϕ\phi: Probability Density Function (PDF)

Code Example

import numpy as np
import array
from deap import base, creator, tools, algorithms
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF
from scipy.stats import norm
# -------------------------
# Black-box Objective Function
# -------------------------
def objective(x):
# quadratic function with noise
return np.sum(np.array(x) ** 2) + np.random.randn() * 0.1
# -------------------------
# Data Storage Class
# -------------------------
class DataStore:
def __init__(self, dim, m_init=5, bounds=(-2, 2)):
self.dim = dim
self.bounds = bounds
self.m = m_init # initial number of samples
self.S = np.random.uniform(bounds[0], bounds[1], size=(m_init, dim))
self.Y = np.array([objective(s) for s in self.S])
# GP surrogate
self.gpr = GaussianProcessRegressor(kernel=RBF(), random_state=0)
self.update_gp()
def update_gp(self):
self.gpr.fit(self.S, self.Y.reshape(-1, 1))
def add_point(self, x):
y = objective(x)
self.S = np.vstack([self.S, x])
self.Y = np.append(self.Y, y)
self.m += 1
self.update_gp()
# -------------------------
# EI function
# -------------------------
def Expected_improvement(x, data):
x_to_predict = np.array(x).reshape(1, -1)
mu, sigma = data.gpr.predict(x_to_predict, return_std=True)
greater_is_better = False
if greater_is_better:
loss_optimum = np.max(data.Y[:data.m])
else:
loss_optimum = np.min(data.Y[:data.m])
scaling_factor = (-1) ** (not greater_is_better)
with np.errstate(divide='ignore'):
Z = scaling_factor * (mu - loss_optimum) / sigma
expected_improvement = scaling_factor * (mu - loss_optimum) * norm.cdf(Z) + sigma * norm.pdf(Z)
expected_improvement[sigma == 0.0] = 0.0
return -expected_improvement[0] # deap minimizes
# -------------------------
# DEAP Setup
# -------------------------
def run_gp_ei(dim=2, n_iter=10):
data = DataStore(dim=dim)
# Fitness definition
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
creator.create("Individual", array.array, typecode='d', fitness=creator.FitnessMin)
toolbox = base.Toolbox()
# For each dimension, create attribute generator
for i in range(dim):
INT_MIN, INT_MAX = data.bounds
toolbox.register(f"attr_float_{i}", np.random.uniform, INT_MIN, INT_MAX)
# Generate individuals
toolbox.register("individual", tools.initCycle, creator.Individual,
(toolbox.__getattribute__(f"attr_float_{i}") for i in range(dim)), n=1)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
# Evaluate = EI
toolbox.register("evaluate", Expected_improvement, data=data)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutUniformInt,
low=[data.bounds[0]]*dim, up=[data.bounds[1]]*dim, indpb=0.2)
toolbox.register("select", tools.selTournament, tournsize=3)
# -------------------------
# Optimization Loop
# -------------------------
for it in range(n_iter):
pop = toolbox.population(n=20)
hof = tools.HallOfFame(1)
algorithms.eaSimple(pop, toolbox, cxpb=0.5, mutpb=0.3,
ngen=15, halloffame=hof, verbose=False)
next_x = np.array(hof[0])
data.add_point(next_x)
print(f"Iter {it}: best_y={data.Y.min():.4f}, next_x={next_x}")
return data
# execute
if __name__ == "__main__":
result = run_gp_ei(dim=2, n_iter=10)