高维数据非参数密度估计的低维流形代表点法

A Low-Dimensional Manifold Representative Point Method to Estimate the Non-parametric Density for High-Dimensional Data

  • 摘要: 非参数核方法由于采用统一的度量标准,在大数据中利用高维样本数据学习时容易遭遇维数灾难问题。挖掘高维空间中的低维几何特性,有助于揭示数据分布的流形结构,进而利用有限样本的高维数据在低维子空间逼近数据的真实分布。基于此,提出一种新的高维数据密度非参数估计的低维流形代表点法,通过从高维空间中挖掘数据分布的几何结构来估计密度。首先,通过寻找局部区域内能够代表流形结构主方向的点,计算局部协方差矩阵,描述局部的数据分布;然后,考虑流形结构中附近数据点不同的影响,根据每个样本数据点对密度的贡献进行加权。与传统的核密度估计方法和流形核密度方法进行了对比实验,结果表明,该方法能够快速稳健地进行密度估计,反映数据的真实分布。

     

    Abstract: When learning from high-dimensional sample data in big data, the non-parametric kernel method uses a unified metric, which is prone to dimensional disasters. If the low-dimensional geometric characteristics embedded in the high-dimensional space are found, it is helpful to reveal the manifold structure of the data distribution, and the high-dimensional data with limited samples can be used to approximate the true distribution of the data in the low-dimensional subspace. Based on this, this paper proposes a new low-dimensional manifold representative point method for non-parametric density estimation of high-dimensional data, which estimates the density by mining the geometric structure of the data distribution from the high-dimensional space. First, the local covariance matrix is calculated and the local data distribution is characterized by looking for points in the local area that can represent the main direction of the manifold structure. Then, each sample data point contribution is weight to density considering the different effects of the data points on or near the manifold structure. The experimental results show that, compared with the traditional kernel density estimation method and the manifold kernel density method, our proposed method can quickly and robustly perform density estimation and reflect the true distribution of data.

     

/

返回文章
返回