Surrogate model-based Derivative-Free Optimization

Surrogate model

\ell(\theta) = m(\theta) + e(\theta)

Low-fidelity model that approximates a high-fidelity (expensive) objective function. Approximates the costly objective function to reduce computational cost.

Definition

Fidelity: The degree to which a model accurately represents the real-world process it is intended to simulate.

$\ell(\theta)$ : upper level objective (actual objective, expensive)
$m(\theta)$ : surrogate model (approximation, cheap)
$e(\theta)$ : noise

Optimization steps

If some hyperparameter $\theta$ is selected, the model can be trained to obtain the test objective function value $\ell(\theta, w^*(\zeta), \mathcal{D}_{\text{test}})$ .

However, this process is time-consuming and subject to noise due to training randomness $\zeta$ .

Therefore, instead of directly optimizing $\ell(\theta)$ , the goal is to create a surrogate model based on a few computed results and use it for more efficient optimization.

Approximate a function
Select next hyperparameter

Radial Basis Functions (RBF) + Stochastic Sampling

m_{RBF}(\theta) = \sum_{i=1}^{N} \lambda_i \phi(||\theta - \theta_i||) + p(\theta)

euclidean distance is being used

radial = distance from a center point

$\theta_i$ : center points (previously evaluated hyperparameters)
$\ell_i$ : average (expected) test objective function values at those points
$\phi$ : function that depends only on the distance from the center (ex: Gaussian kernel, $r^3$ , etc.)
$\lambda_i$ : coefficients that determine the influence of each center $\theta_i$
$p(\theta) = b_0 + b^\top \theta$ : linear polynomial tail (linear trend representing the overall trend)

Conditions

For the surrogate model to work properly, the following two conditions must be satisfied:

Interpolation Condition: The surrogate model must match the actual experimental values at all experimental points.
$m_{\mathrm{RBF}}(\theta_i) = \ell_i, \quad i=1,\dots,N$
Trend Separation Condition: To prevent the RBF part from mixing with the linear trend,
$P^T \lambda = 0$

Solving for coefficients

For parameters $\lambda_1, ..., \lambda_N$ and $\beta = [b_0, b_1, ..., b_d]^T$ determined by solving a linear system of equations:

\begin{bmatrix} \Phi & P \\ P^T & 0 \end{bmatrix} \begin{bmatrix} \lambda \\ b \end{bmatrix} = \begin{bmatrix} \ell \\ 0 \end{bmatrix}

To solve for $\ell$ $ℓ$ , the matrix on the left needs to be invertible.
- This happens if and only if the $rank(P) = d+1$ , where $d$ is the number of hyperparameters (dimension of $\theta$ ).

where:

\Phi_{ij} = \phi(||\theta_i - \theta_j||), \quad P = \begin{bmatrix} \theta_1^T & 1 \\ \theta_2^T & 1 \\ \vdots & \vdots \\ \theta_N^T & 1 \end{bmatrix}, \quad b = \begin{bmatrix} b \\ b_2 \\ \vdots \\ b_{d} \\ b_0 \end{bmatrix}, \quad \ell = \begin{bmatrix} \mathbb{E}_{\zeta}\!\left[ \ell(\theta_1, w_1^*(\zeta), \mathcal{D}_{\text{test}}) \right] \\[6pt] \mathbb{E}_{\zeta}\!\left[ \ell(\theta_2, w_2^*(\zeta), \mathcal{D}_{\text{test}}) \right] \\[6pt] \vdots \\[6pt] \mathbb{E}_{\zeta}\!\left[ \ell(\theta_n, w_n^*(\zeta), \mathcal{D}_{\text{test}}) \right] \end{bmatrix}

$\Phi_{ij} = \phi(||\theta_i - \theta_j||)$ : RBF values based on distances between points
$P$ : matrix formed by appending a column of ones to the coordinates of each point
$\ell$ : upper level evaluation; vector of average objective function values obtained at each experimental point

Solving this system yields the coefficients $\lambda$ and $b$ needed to construct the surrogate model $m_{RBF}(\theta)$ .

Optimization Loop

Choose a few hyperparameters $\theta_i$ and compute the actual objective function values $\ell_i$ .
Solve the above system to build the $m_{\mathrm{RBF}}$ surrogate model.
Based on this model, select a new $\theta$ .
Evaluate that $\theta$ in practice and return to step 2.

How to select the next $\theta^{new}$ to evaluate?

The next evaluation point $\theta^{new}$ is selected by balancing exploitation (choosing points with low predicted values) and exploration (choosing points in less explored areas).

Local Search: Generate $M$ candidates by $\pm$ (small random value) around the best point found so far.
Global Search: Randomly sample $M$ candidates from the entire search space.
For each 2M candidates, compute two scores:
- RBF Score: The predicted value from the surrogate model $m_{RBF}(\theta)$ .
$s_{\text{RBF}}(x_i) = \sum_{j=1}^n \hat{\lambda}_j \phi(\lVert x_i - \theta_j \rVert) + p(x_i)$
- Distance Score: The minimum distance to any previously evaluated point.
$\Delta(x_i, \Theta) = \min_{j=1,\dots,n} ||x_i - \theta_j||$
Scale both scores to the range [0, 1]. $V_\Delta(x_i) = \frac{\Delta_{\max} - \Delta(x_i,\Theta)}{\Delta_{\max} - \Delta_{\min}}, \quad V_s(x_i) = \frac{s_{\text{RBF}}(x_i) - s_{\min}}{s_{\max} - s_{\min}}$
Compute weighted scores $V(x_i) = \omega V_\Delta(x_i) + (1-\omega) V_s(x_i)$
Select the candidate with the lowest weighted score as the next evaluation point $\theta^{new}$ .

This method ensures a balance between exploring new areas and exploiting known good areas in the hyperparameter space.

RBF score focuses on areas predicted to be good (exploitation).
Distance score encourages sampling in less explored areas (exploration).

Note (Stochastic Sampling)

$\ell$ vector inside $\mathbb{E}_\zeta[\cdot]$

In practice, since the expectation $\mathbb{E}_\zeta[\cdot]$ cannot be directly known, the same hyperparameter $\theta_i$ is run multiple times and averaged.

\ell_i \approx \frac{1}{M_i}\sum_{m=1}^{M_i}\ell(\theta_i, w_i^*(\zeta_{i,m}), \mathcal{D}_{\text{test}})

This helps reduce noise caused by randomness.
For important candidate points, the number of repetitions can be increased to estimate the average more accurately.
If the noise is severe, a “regression” approach can be used instead of “interpolation” to allow for some deviation (e.g., ridge regression).

Code Examples

1
import numpy as np
2

3
# -------------------------
4
# Example Objective Function
5
# -------------------------
6
def objective(x):
7
    # quadratic function with noise
8
    return np.sum(np.array(x)**2) + np.random.randn()*0.1
9

10
# -------------------------
11
# RBF Surrogate
12
# -------------------------
13
class RBF_Surrogate:
14
    def __init__(self, phi="cubic"):
15
        if phi == "gaussian":
16
            self.phi = lambda r: np.exp(-(r**2))
17
        elif phi == "linear":
18
            self.phi = lambda r: r
19
        else:  # cubic
20
            self.phi = lambda r: r**3
21

22
    def fit(self, X, y):
23
        self.X = np.array(X)
24
        self.y = np.array(y)
25
        N, d = self.X.shape
26

27
        Phi = np.zeros((N, N))
28
        for i in range(N):
29
            for j in range(N):
30
                Phi[i, j] = self.phi(np.linalg.norm(self.X[i] - self.X[j]))
31

32
        P = np.hstack([self.X, np.ones((N, 1))])
33
        A = np.block([
34
            [Phi, P],
35
            [P.T, np.zeros((d+1, d+1))]
36
        ])
37
        b = np.concatenate([self.y, np.zeros(d+1)])
38

39
        sol = np.linalg.solve(A, b)
40
        self.lmbda = sol[:N]
41
        self.beta = sol[N:]
42

43
    def predict(self, X_new):
44
        X_new = np.atleast_2d(X_new)
45
        y_pred = np.zeros(len(X_new))
46

47
        for k, xk in enumerate(X_new):
48
            rbf_sum = 0
49
            for i in range(len(self.X)):
50
                r = np.linalg.norm(xk - self.X[i])
51
                rbf_sum += self.lmbda[i] * self.phi(r)
52
            trend = np.dot(self.beta[:-1], xk) + self.beta[-1]
53
            y_pred[k] = rbf_sum + trend
54
        return y_pred
55

56
# -------------------------
57
# Data Storage
58
# -------------------------
59
class DataStore:
60
    def __init__(self, dim, m_init=5, bounds=(-2, 2)):
61
        self.dim = dim
62
        self.bounds = bounds
63
        self.m = m_init
64

65
        self.S = np.random.uniform(bounds[0], bounds[1], size=(m_init, dim))
66
        self.Y = np.array([objective(s) for s in self.S])
67
        self.rbf = RBF_Surrogate()
68
        self.update_rbf()
69

70
    def update_rbf(self):
71
        self.rbf.fit(self.S, self.Y)
72

73
    def add_point(self, x):
74
        y = objective(x)
75
        self.S = np.vstack([self.S, x])
76
        self.Y = np.append(self.Y, y)
77
        self.m += 1
78
        self.update_rbf()
79

80
# -------------------------
81
# Candidate Selection (RBF Score + Distance Score)
82
# -------------------------
83
def RBF_score(candidates, data, omega=0.5):
84
    s = data.rbf.predict(candidates)
85
    dist = np.array([np.min(np.linalg.norm(data.S - c, axis=1)) for c in candidates])
86

87
    # 0~1 scaling
88
    s_scaled = (s - s.min()) / (s.max() - s.min() + 1e-8)
89
    d_scaled = (dist.max() - dist) / (dist.max() - dist.min() + 1e-8)
90

91
    score = omega*d_scaled + (1-omega)*s_scaled
92
    return score
93

94
# -------------------------
95
# Optimization Loop
96
# -------------------------
97
def run_rbf(dim=2, n_iter=10):
98
    np.random.seed(0)
99
    data = DataStore(dim=dim)
100

101
    for it in range(n_iter):
102
        best_idx = np.argmin(data.Y)
103
        best_x = data.S[best_idx]
104

105
        # generate candidates
106
        local_candidates = best_x + np.random.uniform(-0.5, 0.5, size=(20, dim))
107
        global_candidates = np.random.uniform(data.bounds[0], data.bounds[1], size=(20, dim))
108
        candidates = np.vstack([local_candidates, global_candidates])
109

110
        # calculate scores
111
        score = RBF_score(candidates, data, omega=0.5)
112

113
        # select next point
114
        next_x = candidates[np.argmin(score)]
115
        data.add_point(next_x)
116

117
        print(f"Iter {it}: best_y={data.Y.min():.4f}, next_x={next_x}")
118

119
    return data
120

121
# execute
122
if __name__ == "__main__":
123
    result = run_rbf(dim=2, n_iter=10)

Gaussian Process (GP) + Expected Improvement (EI)

Gaussian Process (GP) is another popular surrogate model. It uses the mean and covariance structure to model the function $\ell(\theta)$ probabilistically.

m_{GP}(\theta) = \mu + Z(\theta)

$\mu$ : mean of the stochastic process (constant)
$Z(\theta)$ : GP’s randomness - $Z(\theta) \sim \mathcal{N}(0, \sigma^2)$
$Z(\theta)$ at different locations are correlated based on distance

Note

Mean $m_{GP}(\theta^{new})$ : Predicted value of $\ell(\theta^{new})$ based on observed data (most likely value, expectation).
Variance $s^2(\theta^{new})$ $s^{2} (θ^{n e w})$ : Indicates how uncertain the prediction is.
- Near observed data points, variance is very small (almost 0), meaning the prediction is very certain.
- In unexplored regions far from observed data points, variance is large, meaning the prediction is very uncertain.

Correlation Structure

Correlation is defined using a kernel function, based on the assumption that “close inputs will have similar outputs”:

\text{Corr}(Z(\theta_k), Z(\theta_l)) = \exp\left(-\sum_{i=1}^d \gamma_i \lvert \theta_k^{(i)} - \theta_l^{(i)} \rvert^{q_i}\right)

$d$ : dimension of hyperparameters
$\gamma_i$ : scaling parameter for each dimension (estimated via maximum likelihood estimation)
$q_i$ : smoothness parameter (determines the shape of the kernel, usually 1 or 2)

Given data $\{(\theta_i, \ell_i)\}_{i=1}^n$ , the mean and variance at a new point $\theta^{new}$ can be estimated as follows:

\hat{\mu} = \frac{1^T R^{-1} \ell}{1^T R^{-1} 1}, \quad \hat{\sigma}^2 = \frac{(\ell - 1 \hat{\mu})^T R^{-1} (\ell - 1 \hat{\mu})}{n}

$R$ : $n\times n$ correlation matrix ( $R_{kl} = \text{Corr}(Z(\theta_k), Z(\theta_l))$ )
$\ell$ : upper level objective vector (actual observed values)

Predictions at a new point $\theta^{new}$

Prediction based on the weighted average of existing observations.

m_{GP}(\theta^{new}) = \hat{\mu} + r^T R^{-1} (\ell - 1\hat{\mu})

r = \begin{bmatrix} \text{Corr}(Z(\theta^{new}), Z(\theta_1)) \\ \vdots \\ \text{Corr}(Z(\theta^{new}), Z(\theta_n)) \end{bmatrix}

$r^T R^{-1}$ : weight part.
$r$ : correlation vector between $\theta^{new}$ and existing points
$R$ : correlation matrix among existing points.

Hence, the points of $\ell_i$ closer(more correlated) to $\theta^{new}$ have a higher weight in determining the prediction.

Uncertainty estimation at a new point

s^2(\theta^{new}) = \hat{\sigma}^2 \left( 1 - r^T R^{-1} r + \frac{(1 - 1^T R^{-1} r)^2}{1^T R^{-1} 1} \right)

$r^T R^{-1} r$ $r^{T} R^{- 1} r$ : measures how well $\theta^{new}$ $θ^{n e w}$ is explained by existing points.
- If $\theta^{new}$ is very close to existing points (well-explored area), this value approaches 1, making the overall variance $s^2(\theta^{new})$ close to 0 (low uncertainty).
- Conversely, if $\theta^{new}$ is far from existing points (unexplored area), this value approaches 0, leading to higher variance (high uncertainty).

Summary (Final optimization loop)

Compute $\ell(\theta)$ at a few initial points.
Train the GP model to estimate $m_{GP}(\theta)$ and $s(\theta)$ .
Compute Expected Improvement $E[I]$ for candidates $\theta$ .
Select $\theta^{new} = \arg\max_\theta E[I]$ .
Evaluate in practice and add to dataset → go back to step 2.

Expected Improvement (EI)

Find the optimal $\theta$ that maximizes $E[I]$ using genetic algorithms, etc.

To select the next evaluation point, we consider the expected improvement over the current best.

Current best objective function value:
$\ell^{best} = \ell(\theta^{best})$
Improvement at a candidate point $\theta$ :
$I = \ell^{best} - L, \quad L \sim \mathcal{N}(m_{GP}(\theta), s^2(\theta))$
- $L$ $L$ : probabilistic variable reflecting uncertainty
  - a value following a normal distribution with mean $m_{GP}(\theta)$ and variance $s^2(\theta)$
Expected Improvement:
$E[I] = s(\theta) \left( Z \cdot \Phi(Z) + \phi(Z) \right) \quad \text{where} \quad Z = \frac{\ell^{best} - m_{GP}(\theta)}{s(\theta)}$
- $\Phi$ : Cumulative Distribution Function (CDF)
- $\phi$ : Probability Density Function (PDF)

Code Example

1
import numpy as np
2
import array
3
from deap import base, creator, tools, algorithms
4
from sklearn.gaussian_process import GaussianProcessRegressor
5
from sklearn.gaussian_process.kernels import RBF
6
from scipy.stats import norm
7

8
# -------------------------
9
# Black-box Objective Function
10
# -------------------------
11
def objective(x):
12
    # quadratic function with noise
13
    return np.sum(np.array(x) ** 2) + np.random.randn() * 0.1
14

15
# -------------------------
16
# Data Storage Class
17
# -------------------------
18
class DataStore:
19
    def __init__(self, dim, m_init=5, bounds=(-2, 2)):
20
        self.dim = dim
21
        self.bounds = bounds
22
        self.m = m_init  # initial number of samples
23

24
        self.S = np.random.uniform(bounds[0], bounds[1], size=(m_init, dim))
25
        self.Y = np.array([objective(s) for s in self.S])
26

27
        # GP surrogate
28
        self.gpr = GaussianProcessRegressor(kernel=RBF(), random_state=0)
29
        self.update_gp()
30

31
    def update_gp(self):
32
        self.gpr.fit(self.S, self.Y.reshape(-1, 1))
33

34
    def add_point(self, x):
35
        y = objective(x)
36
        self.S = np.vstack([self.S, x])
37
        self.Y = np.append(self.Y, y)
38
        self.m += 1
39
        self.update_gp()
40

41
# -------------------------
42
# EI function
43
# -------------------------
44
def Expected_improvement(x, data):
45
    x_to_predict = np.array(x).reshape(1, -1)
46
    mu, sigma = data.gpr.predict(x_to_predict, return_std=True)
47

48
    greater_is_better = False
49
    if greater_is_better:
50
        loss_optimum = np.max(data.Y[:data.m])
51
    else:
52
        loss_optimum = np.min(data.Y[:data.m])
53

54
    scaling_factor = (-1) ** (not greater_is_better)
55

56
    with np.errstate(divide='ignore'):
57
        Z = scaling_factor * (mu - loss_optimum) / sigma
58
        expected_improvement = scaling_factor * (mu - loss_optimum) * norm.cdf(Z) + sigma * norm.pdf(Z)
59
        expected_improvement[sigma == 0.0] = 0.0
60
    return -expected_improvement[0]  # deap minimizes
61

62
# -------------------------
63
# DEAP Setup
64
# -------------------------
65
def run_gp_ei(dim=2, n_iter=10):
66
    data = DataStore(dim=dim)
67

68
    # Fitness definition
69
    creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
70
    creator.create("Individual", array.array, typecode='d', fitness=creator.FitnessMin)
71

72
    toolbox = base.Toolbox()
73

74
    # For each dimension, create attribute generator
75
    for i in range(dim):
76
        INT_MIN, INT_MAX = data.bounds
77
        toolbox.register(f"attr_float_{i}", np.random.uniform, INT_MIN, INT_MAX)
78

79
    # Generate individuals
80
    toolbox.register("individual", tools.initCycle, creator.Individual,
81
                     (toolbox.__getattribute__(f"attr_float_{i}") for i in range(dim)), n=1)
82
    toolbox.register("population", tools.initRepeat, list, toolbox.individual)
83

84
    # Evaluate = EI
85
    toolbox.register("evaluate", Expected_improvement, data=data)
86
    toolbox.register("mate", tools.cxTwoPoint)
87
    toolbox.register("mutate", tools.mutUniformInt,
88
                     low=[data.bounds[0]]*dim, up=[data.bounds[1]]*dim, indpb=0.2)
89
    toolbox.register("select", tools.selTournament, tournsize=3)
90

91
    # -------------------------
92
    # Optimization Loop
93
    # -------------------------
94
    for it in range(n_iter):
95
        pop = toolbox.population(n=20)
96
        hof = tools.HallOfFame(1)
97

98
        algorithms.eaSimple(pop, toolbox, cxpb=0.5, mutpb=0.3,
99
                            ngen=15, halloffame=hof, verbose=False)
100

101
        next_x = np.array(hof[0])
102
        data.add_point(next_x)
103

104
        print(f"Iter {it}: best_y={data.Y.min():.4f}, next_x={next_x}")
105

106
    return data
107

108
# execute
109
if __name__ == "__main__":
110
    result = run_gp_ei(dim=2, n_iter=10)