基于BERT迁移学习模型的地震灾害社交媒体信息分类研究

Social Media Information Classification of Earthquake Disasters Based on BERT Transfer Learning Model

  • 摘要: 社交媒体数据具有现势性高、传播快、信息丰富、成本低、数据量大等优点,已经成为分析突发灾害事件的重要信息源,但社交媒体数据也存在质量各异、冗余而又不完整、覆盖不均匀、缺少统一规范、隐私与安全难以控制等缺点。为了利用社交媒体数据为灾害应急响应提供精准化依据,需要能够甄别社交媒体内容并进行有效分类的先进技术。利用基于变换器的双向编码表征进行迁移学习,建立文本分类模型,对地震灾害事件后“黄金”72 h内的微博数据进行多标签分类,面向应急需求将标签划分为致灾信息、损失信息、救援救助信息、舆情信息、无用信息5种类型,从而定向挖掘可用于灾情分析的精细化专题信息。所提模型在训练集和测试集上的分类准确率分别达97%和92%,有效提升了微博文本数据的分类精度。评估结果表明,所提模型能够较好地分类社交媒体中地震灾害标签信息,可应用于地震灾害事件的快速灾情研判,这种社交媒体灾情信息获取方法弥补了传统灾害信息获取手段的滞后性。

     

    Abstract:
    Objectives With the rapid development of the Internet, social media has become an important information source of emergency events. However, there are a lot of duplication, errors and even malicious contents in social media, which need to be effectively classified to provide more accurate information for disaster emergency response.
    Methods Deep learning has greatly improved the accuracy and efficiency of text classification. This paper takes earthquake disaster as an example, and builds a multi-label classification model based on bidirectional encoder representation from transformers (BERT) transfer learning. Over 50 000 posts about 5 earthquakes are collected as training samples from SINA Weibo, which is a very popular social media in China. Each sample is manually marked as one or more labels, such as hazards information, loss information, rescue information, public opinion information and useless information.
    Results By fine-tune training, the classification accuracies of the proposed model on training dataset and test dataset reach 97% and 92%, respectively. The area under curve score of each label ranges from 0.952 to 0.998.
    Conclusions The results prove that the multi-label classification using BERT transfer learning is of high reliability. The proposed model can be applied to the emergency management services for earthquake events, which is beneficial for the rapid disaster rescue and relief.

     

/

返回文章
返回