一种基于共词网络的社交媒体数据主题挖掘方法

A New Social Media Topic Mining Method Based on Co-word Network

  • 摘要: 对社交媒体所包含文本数据的深入挖掘,有利于有效地进行后续的时空分析。提出了一种新的基于共词网络的社交媒体数据主题挖掘方法,依据词频-逆文档频率分析,自动筛选出与主题相关的关键词汇,基于微博间是否包含相同的关键词汇,提出构建以微博为节点的共词网络,并结合Louvain社区探测算法进行文本主题挖掘。所提出的方法是一种无监督方法,且具有不需要指定聚类数目的优点。实验表明,该方法在主题挖掘表现上,准确率和召回率均优于常用的文档主题生成模型。以收集的2012年北京暴雨期间包含关键词的微博为例,利用提出的方法对微博数据集进行挖掘和时空分析,结果表明所提方法在实际应用中的有效性。

     

    Abstract: The in-depth exploration of the text data contained in social media facilitates efficient analysis of time and space. This paper proposes a new social media topic mining method based on the concept of co-word network and community detection. The method uses term frequency-inverse document frequency (TF-IDF) analysis to identify the key words of the messages automatically. Based on the problem whether the microblogs contain the same key words or not, we put forward the concept of microblog co-word network with microblog as the node. The network combined with the Louvain community detection algorithm is used to classify the microblogs into different clusters with topics. The proposed method is an unsupervised method. The advantage of this method is that there is no need to specify the number of clusters. Experiments demonstrate that the performance of the proposed method is better than the commonly used latent dirichlet allocation (LDA) model on both precision and recall. Taking the collected microblogs during the 2012 Beijing rainstorm as the case study, the method is used to conduct in-depth mining and time-space analysis of the microblogs dataset. The results demonstrate that the proposed method is effective in real world applications.

     

/

返回文章
返回