POD-Hierarchical-Kriging algorithm model is mainly based on Kriging model. Kriging model regards the unknown function as the concrete realization of a static stochastic process, including regression part and correlation part, among which the correlation part can be regarded as the realization of random Gaussian process, and has a good fitting effect for problems with a high degree of nonlinearity. Kriging model is based on the dynamic construction of sample information at known points and has both global and local statistical characteristics. The POD algorithm [12–14] is used to downscale and reconstruct the original samples, reduce the sample computational loss, extract the core information, filter out the marginal information. he data processed by the POD algorithm is used as the sample input for the Hierarchical-Kriging model. The above whole constitutes the POD-Hierarchical-Kriging algorithm model.
3.1.1 POD method
The core of the POD method is to find a set of "optimal" orthogonal bases \(\{{{U}}_{1},{{U}}_{2},{{U}}_{3},\cdots ,{{U}}_{n}\}\) of the n-dimensional field space \(\{{\mathbf{x}}_{\text{n}}\in {\Omega }\}\). Assuming that \({\mathbf{x}}_{\text{n}}\) can be approximated by a number of orthogonal bases, then \({\mathbf{x}}_{\text{n}}\) can be approximated by this set of "optimal" orthogonal bases \(\{{{U}}_{1},{{U}}_{2},{{U}}_{3},\cdots ,{{U}}_{n}\}\):
$${{x}}_{n}=\sum _{i=1}^{n}{{\alpha }_{i}{U}}_{i}$$
1
Where \({{U}}_{i}\) is the eigenvector or basis vector of \({{x}}_{n}\); \({\alpha }_{i}\) is the correlation coefficient of \({{U}}_{i}\). If the dimension of the orthogonal space composed of "optimal" orthogonal bases is smaller than that of the original space, the above approximation process can be described as a downscaling of the sample space, followed by an approximate reconstruction of the sample, then:
$${{x}}_{n}=\sum _{i=1}^{r}{\alpha }_{i}{{U}}_{i} (r<n)$$
2
Therefore, the centralization of the sample data is performed. Then the new centralized sample data set is obtained:
$$\tilde{{X}}=\left\{{\tilde{{x}}}_{1},{\tilde{{x}}}_{2},{\tilde{{x}}}_{3},\cdots ,{\tilde{{x}}}_{n}\right\}$$
$$=\{{{x}}_{1}^{ }-\stackrel{-}{{{x}}_{1}},{{x}}_{2}^{ }-\stackrel{-}{{{x}}_{2}},{{x}}_{3}^{ }-\stackrel{-}{{{x}}_{3}},\cdots ,{{x}}_{n}^{ }-\stackrel{-}{{{x}}_{n}}\}$$
3
Followed by SVD singular value decomposition:
$$\tilde{{X}}={U}{\varSigma }{{V}}^{T}$$
4
For the value of \(r\), \(r\) should be less than n to reduce the scale of the feature vector space, but it should approximate the original data space as accurately as possible. The selection of r can be determined in the following ways:
$$I=\sum _{i=1}^{r}{\left({\lambda }_{i}\right)}^{2}/\sum _{i=1}^{n}{\left({\lambda }_{i}\right)}^{2}$$
5
\(I\) is the energy information capacity or energy. The closer \(I\) is to 1, the more complete the original information contained in the feature vector space is and the closer it is to the original function space. it is usually considered that \(I\) is greater than 95% can be.Where \(\lambda\) is the eigenvalue of the \(\tilde{{X}}\) covariance matrix arranged from largest to smallest, satisfying \({\lambda }_{1}>{\lambda }_{2}>\cdots >{\lambda }_{n}\). Then the final POD model is:
$${{X}}_{RE}=\stackrel{-}{{X}}+\sum _{i=1}^{r}{{\alpha }_{i}{U}}_{i}$$
6
3.1.2 Basic Kriging model
The Kriging model is an unbiased estimation model. It has the characteristics of local estimation and has a good fit for problems with a high degree of nonlinearity. The following is a reference to the derivation process of Kriging method theory [16, 17]. In the analysis of regression model, the model assumes that the real relationship between the response value of the system and the independent variable can be expressed in the following form:
$${y}\left({x}\right)={f}({x}{)}^{T}{\beta }+{z}({x})$$
7
Where \({x}\) input parameter; \({f}\left({x}\right)={\left[{f}_{1}\left({x}\right),{f}_{2}\left({x}\right),\cdots ,{f}_{p}\left({x}\right)\right]}^{\text{T}}\); \({\beta }\) is the regression constant, expressed as \({\beta }={\left[{\beta }_{1},{\beta }_{2},\dots ,{\beta }_{p}\right]}^{T}\), \(p\) is the number of polynomial terms, the magnitude depends on the form of the polynomial; \(y\left({x}\right)\) is the predicted value of combustion efficiency \(\eta\)(%) or total pressure loss \(\varDelta p\)(%), \({y}\left({x}\right)=[\eta ,\varDelta p]\); \({z}\left({x}\right)\) is a random process, which has the following statistical properties:
$$\left\{\begin{array}{c}E\left[z\right(x\left)\right]=0\\ Var\left[z\right(x\left)\right]={\sigma }^{2}\\ E\left[{z}\left({{x}}^{\text{i}}\right),{z}\left({x}\right)\right]={\sigma }^{2}R\end{array}\right.$$
8
Where:
$${R}=\left[\begin{array}{cccc}{\rho }_{11}& {\rho }_{12}& \cdots & {\rho }_{1m}\\ {\rho }_{21}& {\rho }_{22}& \cdots & {\rho }_{2m}\\ ⋮& ⋮& \ddots & ⋮\\ {\rho }_{m1}& {\rho }_{m2}& \cdots & {\rho }_{mm}\end{array}\right]$$
9
Where \({R}\) is "correlation matrix"; \({\rho }_{ij}\) is the correlation function value, representing the correlation between the ith sample point and the jth sample point; m is the size of the sample. Correlation function is artificially assumed, and Gauss function is commonly used. In this experiment, the Gauss function is used as the correlation function, and the specific form is as follows:
$${\rho }_{ij}=\text{e}\text{x}\text{p}\left(-\sum _{ℎ=1}^{n} {\theta }_{ℎ}{\left|{x}_{ℎ}^{i}-{x}_{ℎ}^{j}\right|}^{2}\right),i,j=\text{1,2},\dots ,m$$
10
Where the unknown parameter \({\theta }=\left[{\theta }_{1},{\theta }_{2},\dots ,{\theta }_{n}\right]\), the dimension n is the same size as the dimension of the sample point. \({x}_{ℎ}^{i}\) is the hth variable of the ith sample and can be expressed as any of the input parameters.
A prediction model formula (11) is given to approximate formula (7) :
$$\stackrel{\prime }{{y}}\left({x}\right)={{c}}^{T}\left({x}\right){Y}$$
11
Where \({Y}={\left[{\text{y}}_{1},{\text{y}}_{2},\dots ,{\text{y}}_{\text{m}}\right]}^{T}\), \({Y}\) is the known sample response vector; \({c}={\left[{c}_{1}\left({x}\right),{c}_{2}\left({x}\right),\dots ,{c}_{m}\left({x}\right)\right]}^{T}\), \({c}_{i}\) is related to a single sample point \({x}\). When the given sample points \({x}\) are different, the resulting \({c}_{i}\) is different.
Prediction error of the model is:
$$\stackrel{\prime }{{y}}\left({x}\right)-{y}\left({x}\right)={{c}}^{T}{z}-{z}+{\left({{F}}^{T}{c}-{f}\left({x}\right)\right)}^{T}{\beta }$$
12
Where \({z}={\left[{z}\left({{x}}_{1}\right),{z}\left({{x}}_{2}\right),\cdots ,{z}\left({{x}}_{m}\right)\right]}^{\text{T}}\), \({F}={\left[{{f}}^{T}\left({{x}}_{1}\right),{{f}}^{T}\left({{x}}_{2}\right),\dots ,{{f}}^{T}\left({{x}}_{m}\right)\right]}^{T}\).
Considering that \(\stackrel{\prime }{{y}}\left({x}\right)\) is the optimal linear unbiased estimate of \({y}\left({x}\right)\), Eq. (13) is obtained from the unbiased estimate, and Eq. (14) is obtained from the minimum mean square error.
$${{F}}^{T}{c}={f}\left({x}\right)$$
13
$$MSE={{\sigma }}^{2}\left(1+{{c}}^{T}{R}{c}-2{{c}}^{T}{r}\right)$$
14
Where \({r}\) is the correlation vector composed of the correlation function between the point \({x}\) to be predicted and the original sample set.
By constructing Lagrange function, the response model expression can be obtained:
$$\stackrel{\prime }{{y}}\left({x}\right)={f}({x}{)}^{T}{{\beta }}^{\ast }+{r}({x}{)}^{T}{{\gamma }}^{\ast }$$
15
Where, \({{\beta }}^{\ast }={\left({{F}}^{T}{{R}}^{-1}{F}\right)}^{-1}{{F}}^{T}{{R}}^{-1}{Y}\), \({{\gamma }}^{\ast }={{R}}^{-1}\left({Y}-{{F}}^{T}{{\beta }}^{\ast }\right)\). It can be seen that \({F}\),\({Y}\) can be obtained from the given sample, R contains only the parameter \({\theta }\). \({{\beta }}^{\ast }\), \({{\gamma }}^{\ast }\), \({r}\left({x}\right)\) can be obtained by finding the unknown parameter \({\theta }\).
Since each sample point is not independent, the joint probability density is:
$$L\left({\beta },{{\sigma }}^{2},{\theta }\right)=\frac{1}{\left(2\pi {)}^{m/2}{\left({{\sigma }}^{2}\right)}^{m/2}\right|\mathbf{R}{|}^{1/2}}\text{e}\text{x}\text{p}\left[-\frac{1}{2{{\sigma }}^{2}}{\left({Y}-{{F}}^{\text{T}}{\beta }\right)}^{T}{\mathbf{R}}^{-1}\left({Y}-{{F}}^{\text{T}}{\beta }\right)\right]$$
16
Taking the logarithm:
$$\text{l}\text{n}L\approx -\frac{1}{2}\left(m\text{l}\text{n}{\stackrel{\prime }{{\sigma }}}^{2}+\text{l}\text{n}\left(\right|\mathbf{R}\left({\theta }\right)\left|\right)\right)$$
17
Where \(\text{l}\text{n}L\) only contains parameter \({\theta }\), the parameter training is transformed into the solution of a nonlinear optimization problem, which in turn yields R. Then \({{\beta }}^{\ast }\),\({{\gamma }}^{\ast }\)༌\({r}\left({x}\right)\) are, si, ilarly available, so as to determine the final model.
3.1.3 Hierarchical-Kriging model
The difference between Hierarchical-Kriging and basic Kriging is that Hierarchical-Kriging adopts multi-layer basic Kriging structure [1, 15]. Hierarchical-Kriging allows the output of the first layer to be the global approximate reference of the second layer, the second layer to be the global approximate reference of the third layer, and so on. Hierarchical-Kriging has a very great advantage in practical applications to make full use of the sample information [16, 17]. Basic Kriging is too sensitive to the noise of training samples and requires high reliability of training samples, while it is easy to overfit training samples with low reliability. However, Hierarchical-Kriging adopts a Hierarchical strategy, training the first layer with samples of low confidence and the second layer with samples of high confidence, so as to fully explore the potential of training samples. Take Hierarchical-Kriging with two layers as an example:
To construct two-layer Kriging, the first layer Kriging should be constructed with low confidence training sample points.
$${y}_{1\text{f}}\left({x}\right)={\beta }_{1\text{f}}+{z}_{1\text{f}}\left({x}\right)$$
18
The polynomial in the model of the first layer is a constant, because the reliability of the training sample is low, it is easy to fit the noise with the polynomial of higher order, resulting in the decline of the prediction accuracy of the final model.
$${\stackrel{\prime }{y}}_{1\text{f}}\left({x}\right)={\beta }_{1\text{f}}+{{r}}_{1\text{f}}^{T}\left({x}\right){\mathbf{R}}_{1\text{f}}^{-1}\left({\mathbf{y}}_{1\text{f}}-{\beta }_{1\text{f}}1\right)$$
19
Where \({\beta }_{1\text{f}}={\left({1}^{T}{\mathbf{R}}_{1\text{f}}^{-1}1\right)}^{-1}{1}^{T}{\mathbf{R}}_{1\text{f}}^{-1}{\mathbf{y}}_{1\text{f}}\),\({\mathbf{R}}_{1\text{f}}\in {\mathbb{ℝ}}^{{n}_{1\text{f}}\times {n}_{1\text{f}}}\)༌\(1\in {\mathbb{ℝ}}^{{n}_{1\text{f}}}\)༌\({{r}}_{1\text{f}}\in {\mathbb{ℝ}}^{{n}_{1\text{f}}}\). This is the first l, yer through, he res, lt of the training sample points out.
The second layer structure is constructed on the basis of the first layer, and the high reliability sample points are used to train the second layer.
$${y}_{2\text{f}}\left({x}\right)=\beta {\stackrel{\prime }{y}}_{1\text{f}}\left({x}\right)+z\left({x}\right)$$
20
Using the same derivation as the first level, it can be derived that:
$${\stackrel{\prime }{y}}_{2\text{f}}\left({x}\right)=\beta {\stackrel{\prime }{y}}_{1\text{f}}\left({x}\right)+{{r}}_{2\text{f}}^{T}\left({x}\right){\mathbf{R}}_{2\text{f}}^{-1}\left({\mathbf{y}}_{2\text{f}}-\beta {F}\right)$$
21
Where \({F}={\left[{\stackrel{\prime }{y}}_{1\text{f}}\left({{x}}_{1}\right),\dots ,{\stackrel{\prime }{y}}_{1\text{f}}\left({{x}}_{m}\right)\right]}^{T}\); \({\mathbf{R}}_{2\text{f}}\) and \({{r}}_{2\text{f}}\) are both correlation matrices and correlation vectors composed with high confidence sample points (in the same specific form as basic kriging); \({\mathbf{y}}_{2\text{f}}\) is a vector composed of responses from sample points; \(\beta ={\left({{F}}^{T}{\mathbf{R}}_{2\text{f}}{ }^{-1}{F}\right)}^{-1}{{F}}^{T}{\mathbf{R}}_{2\text{f}}{ }^{-1}{\mathbf{y}}_{2\text{f}}\).