One of the main challenges in unsupervised learning is to find suitable values for the model parameters. In kernel principal component analysis (kPCA), for example, these are the number of components, the kernel, and its parameters. This paper presents a model selection criterion based on distance distributions (MDDs). This criterion can be used to find the number of components and the σ² parameter of radial basis function kernels by means of spectral comparison between information and noise. The noise content is estimated from the statistical moments of the distribution of distances in the original dataset.
This allows for a type of randomization of the dataset, without actually having to permute the data points or generate artificial datasets. After comparing the eigenvalues computed from the estimated noise with the ones from the input dataset, information is retained and maximized by a set of model parameters. In addition to the model selection criterion, this paper proposes a modification to the fixed-size method and uses the incomplete Cholesky factorization, both of which are used to solve kPCA in large-scale applications. These two approaches, together with the model selection MDD, were tested in toy examples and real life applications, and it is shown that they outperform other known algorithms.