Kernel Density Estimation — zfit 0.20.0 documentation (2024)

Authors:: Marc Steiner; Jonas Eschle

An introduction to Kernel Density Estimations, explanations to all methods implemented in zfit and a throughoutcomparison of the performance can be found inPerformance of univariate kernel density estimation methods in TensorFlowby Marc Steiner from which many parts here are taken.

Estimating the density of a population can be done in a non-parametric manner. The simplest form is to create adensity histogram, which is notably not so precise.

A more sophisticated non-parametric method is the kernel density estimation (KDE), which can be looked at as a sort ofgeneralized histogram. In a kernel density estimation each data point is substituted with a so called kernel functionthat specifies how much it influences its neighboring regions. This kernel functions can then be summed up to get anestimate of the probability density distribution, quite similarly as summing up data points inside bins.

However, sincethe kernel functions are centered on the data points directly, KDE circumvents the problem of arbitrary bin positioning.Since KDE still depends on kernel bandwidth (a measure of the spread of the kernel function) instead of bin width,one might argue that this is not a major improvement. Upon closer inspection, one finds that the underlying PDFdoes depend less strongly on the kernel bandwidth than histograms do on bin width and it is much easier to specifyrules for an approximately optimal kernel bandwidth than it is to do so for bin width.

Given a set of \(n\) sample points \(x_k\) (\(k = 1,\cdots,n\)), an exact kernel densityestimation \(\widehat{f}_h(x)\) can be calculated as

(1)#\[\widehat{f}_h(x) = \frac{1}{nh} \sum_{k=1}^n K\Big(\frac{x-x_k}{h}\Big)\]

where \(K(x)\) is called the kernel function, \(h\) is the bandwidth of the kernel and \(x\) is thevalue for which the estimate is calculated. The kernel function defines the shape and size of influence of a singledata point over the estimation, whereas the bandwidth defines the range of influence. Most typically a simpleGaussian distribution (\(K(x) :=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}\)) is used as kernel function.The larger the bandwidth parameter :math:h the larger is the range of influence ofa single data point on the estimated distribution.

Exact KDEs#

SummaryAn exact KDE calculates the true sum of the kernels around the input points without approximating thedataset, leaving possibilities for the choice of a bandwidth.The computational complexity – especially the peak memory consumption – is proportional to the sample size.Therefore, this kind of PDF is better used for smaller datasets and a Grid KDE is preferred for larger data.

Explanation

Exact KDEs implement exactly Eq. (1) without any approximation and therefore no loss of information.

The computational complexity of the exact KDE above is given by \(\mathcal{O}(nm)\) where \(n\)is the number of sample points to estimate from and \(m\) is the number of evaluation points(the points where you want to calculate the estimate).

Due to this cost, the method is most often used for smaller datasamples.

There exist several approximative methods to decrease this complexity and therefore decrease the runtime as well.

Implementation

In zfit, the exact KDE KDE1DimExact takes an arbitrary kernel, which is aTensorFlow-Probability distribution.

import osos.environ["ZFIT_DISABLE_TF_WARNINGS"] = "1"import zfitfrom zfit import zimport numpy as npimport matplotlib.pyplot as plt

obs = zfit.Space('x', (-5, 5))gauss = zfit.pdf.Gauss(obs=obs, mu=0, sigma=2)data = gauss.sample(1000)kde = zfit.pdf.KDE1DimExact(data, # obs, bandwidth, kernel, # padding, weights, name )x = np.linspace(-5, 5, 200)plt.plot(x, kde.pdf(x), label='Exact KDE')plt.plot(x, gauss.pdf(x), label='True PDF')plt.legend()

<matplotlib.legend.Legend at 0x7f2629113830>

Grid KDEs#

Summary To reduce the computational complexity, the input data can be finely binned into a histogram, where eachbin serves as the sample point to a kernel. This is well suited for larger datasets. Like the exact KDE, thisleaves the possibility to choose a bandwidth.

Explanation

The most straightforward way to decrease the computational complexity is by limiting the number of sample points.This can be done by a binning routine, where the values at a smaller number of regular grid points are estimatedfrom the original larger number of sample points.Given a set of sample points \(X = \{x_0, x_1, ..., x_k, ..., x_{n-1}, x_n\}\) with weights :math:w_k and a set ofequally spaced grid points \(G = \{g_0, g_1, ..., g_l, ..., g_{n-1}, g_N\}\) where \(N < n\)we can assign an estimate(or a count) \(c_l\) to each grid point \(g_l\) and use the newly found \(g_l\) to calculatethe kernel density estimation instead.

(2)#\[\widehat{f}_h(x) = \frac{1}{nh} \sum_{l=1}^N c_l \cdot K\Big(\frac{x-g_l}{h}\Big)\]

This lowers the computational complexity down to \(\mathcal{O}(N \cdot m)\).Depending on the number of grid points \(N\) there is tradeoff between accuracy and speed.However as we will see in the comparison chapter later as well, even for ten million sample points, a grid of size\(1024\) is enough to capture the true density with high accuracy. As described in the extensive overviewby Artur Gramacki[@gramacki2018fft], simple binning or linear binning can be used, although the last is oftenpreferred since it is more accurate and the difference in computational complexity is negligible.

Implementation

The implementation of Grid KDEs is similar to the exact KDEs, except that the former first bins the incomming data anduses this as a grid for the kernel. Therefore, it also takes parameters for the binning, such as the number of binsand the method.

data = gauss.sample(100_000)kde = zfit.pdf.KDE1DimGrid(data, # obs, bandwidth, kernel, num_grid_points, # binning_method, padding, weights, name )plt.plot(x, kde.pdf(x), label='Grid KDE')plt.plot(x, gauss.pdf(x), label='True PDF')plt.legend()

Bandwidth#

Summary Bandwidth denotes the width of the kernel and aims usually at reducing the integrated squared error.There are two main distinction, a global and local bandwidths. The formeris the same width for all kernels while the latter uses different bandwidth for each kernel and therefore can, ingeneral, better reflect the local density, often at a computational cost.

Explanation

The optimal bandwidth is often defined as the one that minimizes themean integrated squared error (\(MISE\)) between the kernel densityestimation \(\widehat{f}_{h,norm}(x)\) and the true probabilitydensity function \(f(x)\), where \(\mathbb{E}_f\) denotes theexpected value with respect to the sample which was used to calculatethe KDE.

(3)#\[MISE(h) = \mathbb{E}_f\int [\widehat{f}_{h,norm}(x) - f(x)]^2 dx\]

To find the optimal bandwidth it is useful to look at the second orderderivative \(f^{(2)}\) of the unknown distribution as it indicateshow many peaks the distribution has and how steep they are. For adistribution with many narrow peaks close together a smaller bandwidthleads to better result since the peaks do not get smeared together to asingle peak for instance.

An asymptotically optimal bandwidth\(h_{AMISE}\) which minimizes a first-order asymptotic approximationof the \(MISE\) is then given by

(4)#\[`h_{AMISE}(x) = \Big( \frac{1}{2N\sqrt{\pi} \| f^{(2)}(x)\|^2}\Big)^{\frac{1}{5}}`\]

where \(N\) is the number of sample points (or grid points ifbinning is used).

Implementation

zfit implements a few different bandwidth methods, of which not all are available for all KDEs.

Rule of thumb: Scott and Silvermann both proposed rule of thumb for the kernel bandwidth. These are approximate estimatesthat are a global parameter.
Adaptive methods: These methods calculate a local density parameter that is approximately \(/propto f( x ) ^ {-1/2}\),where \(f(x)\) is an initial estimate of the density. This can be calculated for example by usinga rule of thumb to obtain an initial estimate.The main disadvantage is a higher computational complexity; the initial overhead as well asfor the evaluation of the PDF. Most notably the memory consumption can be significantly higher.

FFT KDEs#

Summary By rewriting the KDE as a discrete convolution and using the FFT, the density can beapproximated interpolating between the discetized values.

Another technique to speed up the computation is rewriting the kerneldensity estimation as convolution operation between the kernel functionand the grid counts (bin counts) calculated by the binning routine givenabove.

By using the fact that a convolution is just a multiplication in Fourierspace and only evaluating the KDE at grid points one can reduce thecomputational complexity down to\(\mathcal{O}(\log{N} \cdot N)\)

Using Eq. (2) from above only evaluated at gridpoints gives us

(5)#\[\widehat{f}_h(g_j) = \frac{1}{nh} \sum_{l=1}^N c_l \cdot K\Big(\frac{g_j-g_l}{h}\Big) = \frac{1}{nh}\sum_{l=1}^N k_{j-l} \cdot c_l\]

where \(k_{j-l} = K(\frac{g_j-g_l}{h})\).

If we set \(c_l = 0\) for all \(l\) not in the set\(\{1, ..., N\}\) and notice that \(K(-x) = K(x)\) we can extendEq. (5) to a discrete convolution as follows

\[\widehat{f}_h(g_j) = \frac{1}{nh} \sum_{l=-N}^N k_{j-l} \cdot c_l = \vec{c} \ast \vec{k}\]

By using the convolution theorem we can fourier transform\(\vec{c}\) and \(\vec{k}\), multiply them and inverse fouriertransform them again to get the result of the discrete convolution.

Due to the limitation of evaluating only at the grid pointsthemselves, one needs to interpolate to get values for the estimateddistribution at points in between.

Implementation

This is implemented in zfit as KDE1DimFFT. Itsupports similar arguments such as the grid KDEs except that thebandwidth can’t be variable.

kde = zfit.pdf.KDE1DimFFT(data, # obs, bandwidth, kernel, num_grid_points, fft_method, # binning_method, padding, weights, name )plt.plot(x, kde.pdf(x), label='FFT KDE')plt.plot(x, gauss.pdf(x), label='True PDF')plt.legend()

<matplotlib.legend.Legend at 0x7f262f4f5730>

ISJ KDEs#

Summary A different take on KDEs isa new adaptive kernel density estimator based on lineardiffusion processes. It performs especially well on data that do not followa normal distribution. The method also includes an estimation for the optimalbandwidth.

The method is described completely in the paper ‘Kernel densityestimation by diffusion’ by .The general idea is briefly sketched below.

As explained in Bandwidth, the optimal bandwidth is oftendefined as the one that minimizes the(\(MISE\)) as defined in Eq. (3) respectively afirst-order asymptotic approximation \(h_{AMISE}\) as defined inEq. (4).

As Sheather and Jones showed, this second order derivative can beestimated, starting from an even higher order derivative\(\|f^{(l+2)}\|^2\) by using the fact that\(\|f^{(j)}\|^2 = (-1)^j \mathbb{E}_f[f^{(2j)}(X)], \text{ } j\geq 1\)

\[h_j=\left(\frac{1+1 / 2^{j+1 / 2}}{3} \frac{1 \times 3 \times 5 \times \cdots \times(2 j-1)}{N \sqrt{\pi / 2}\left\|f^{(j+1)}\right\|^{2}}\right)^{1 /(3+2 j)} = \gamma_j(h_{j+1})\]

where \(h_j\) is the optimal bandwidth for the \(j\)-thderivative of \(f\) and the function \(\gamma_j\) defines thedependency of \(h_j\) on \(h_{j+1}\)

Their proposed plug-in method works as follows:

Compute \(\|\widehat{f}^{(l+2)}\|^2\) by assuming that \(f\)is the normal pdf with mean and variance estimated from the sampledata
Using \(\|\widehat{f}^{(l+2)}\|^2\) compute \(h_{l+1}\)
Using \(h_{l+1}\) compute \(\|\widehat{f}^{(l+1)}\|^2\)
Repeat steps 2 and 3 to compute \(h^{l}\),\(\|\widehat{f}^{(l)}\|^2\), \(h^{l-1}\), \(\cdots\) andso on until \(\|\widehat{f}^{(2)}\|^2\) is calculated
Use \(\|\widehat{f}^{(2)}\|^2\) to compute \(h_{AMISE}\)

The weakest point of this procedure is the assumption that the truedistribution is a Gaussian density function in order to compute\(\|\widehat{f}^{(l+2)}\|^2\). This can lead to arbitrarily badestimates of \(h_{AMISE}\), when the true distribution is far frombeing normal.

Therefore Botev et al. took this idea further. Giventhe function \(\gamma^{[k]}\) such that

\[\gamma^{[k]}(h)=\underbrace{\gamma_{1}\left(\cdots \gamma_{k-1}\left(\gamma_{k}\right.\right.}_{k \text { times }}(h)) \cdots)\]

\(h_{AMISE}\) can be calculated as

\[h_{AMISE} = h_{1}=\gamma^{[1]}(h_{2})= \gamma^{[2]}(h_{3})=\cdots=\gamma^{[l]}(h_{l+1})\]

By setting \(h_{AMISE}=h_{l+1}\) and using fixed point iteration tosolve the equation

\[h_{AMISE} = \gamma^{[l]}(h_{AMISE})\]

the optimal bandwidth \(h_{AMISE}\) can be found directly.

This eliminates the need to assume normally distributed data for theinitial estimate and leads to improved accuracy, especially fordensity distributions that are far from normal.According to their paper increasing \(l\) beyond\(l=5\) does not increase the accuracy in any practically meaningfulway. The computation is especially efficient if \(\gamma^{[5]}\) iscomputed using the Discrete Cosine Transform - an FFT relatedtransformation.

The optimal bandwidth \(h_{AMISE}\) can then either be used forother kernel density estimation methods (like the FFT-approach discussedabove) or also to compute the kernel density estimation directly usinganother Discrete Cosine Transform.

kde = zfit.pdf.KDE1DimISJ(data, # obs, num_grid_points, binning_method, # padding, weights, name )plt.plot(x, kde.pdf(x), label='ISJ KDE')plt.plot(x, gauss.pdf(x), label='True PDF')plt.legend()

<matplotlib.legend.Legend at 0x7f2628d590a0>

Boundary bias and padding#

KDEs have a peculiar weakness: the boundaries, as the outside has a zero density. This makes the KDEgo down at the bountary as well, as the density approaches zero, no matter what thedensity inside the boundary was.

obs = zfit.Space('x', (-2, 0.5)) # will cut of data at -2, 0.5data_narrow = gauss.sample(1000, limits=obs)kde = zfit.pdf.KDE1DimExact(data_narrow)x = np.linspace(-2, 0.5, 200)plt.plot(x, kde.pdf(x), label='Biased KDE')plt.plot(x, gauss.pdf(x, obs), label='True PDF')plt.legend()

<matplotlib.legend.Legend at 0x7f2628cd90a0>

There are two ways to circumvent this problem:

The best solution: providing a larger dataset than the default space the PDF is used in

obs_wide = zfit.Space('x', (-5, 5))data_wide = gauss.sample(1000, limits=obs_wide)kde = zfit.pdf.KDE1DimExact(data, obs=obs)plt.plot(x, kde.pdf(x), label='Wide KDE')plt.plot(x, gauss.pdf(x, obs), label='True PDF')plt.legend()

<matplotlib.legend.Legend at 0x7f2628c79b20>

To get an idea of what happened, this is actually the full PDF. Notice that it is normalized overobs.

x = np.linspace(-5, 5, 200)plt.plot(x, kde.pdf(x), label='Wide KDE')plt.plot(x, gauss.pdf(x, obs), label='True PDF')plt.legend()x = np.linspace(-2, 0.5, 200)

Another technique, as we may don’t have more data on the edges, is to mirrorthe existing data at the boundaries, which is equivalent to a boundary conditionwith a zero derivative. This is a padding technique and can improve the boundaries.

kde = zfit.pdf.KDE1DimExact(data_narrow, obs=obs, padding=0.2)plt.plot(x, kde.pdf(x), label='Padded KDE')plt.plot(x, gauss.pdf(x, obs), label='True PDF')plt.legend()

<matplotlib.legend.Legend at 0x7f2628ed5ac0>

However, one important drawback of this method is to keep in mind that this will actuallyalter the PDF to look mirrored. Plotting the PDF in a larger range makes thisclear.

x = np.linspace(-5, 5, 200)plt.plot(x, kde.pdf(x), label='Padded KDE')plt.plot(x, gauss.pdf(x, obs), label='True PDF')plt.legend()

<matplotlib.legend.Legend at 0x7f2628b881d0>

Download this tutorial notebook,script

Kernel Density Estimation — zfit 0.20.0 documentation (2024)

Exact KDEs#

Grid KDEs#

Bandwidth#

FFT KDEs#

ISJ KDEs#

Boundary bias and padding#

References