计算机科学
命名实体识别
任务(项目管理)
嵌入
词(群论)
文字嵌入
自然语言处理
人工智能
领域(数学分析)
F1得分
集合(抽象数据类型)
数学
经济
管理
程序设计语言
哲学
数学分析
语言学
作者
Smita Srivastava,Biswajit Paul,Deepa Gupta
标识
DOI:10.1016/j.procs.2023.01.027
摘要
A vast majority of cyber security information is in the form of unstructured text. A much-needed task is to have a machine-assisted analysis of such information. Named Entity Recognition (NER) provides a vital step towards this conversion. However, cyber security named entities are not restricted to classical entity types like people, location, organisation, miscellaneous etc but comprise a large set of domain-specific entities. Word embedding has emerged as the dominant choice for the initial transfer of semantics to downstream NLP tasks and impacts performance. Though several word embeddings learned using general purpose large corpus like Google News, Wikipedia etc. are available as pre-trained embeddings and have shown good performance on NER tasks; this trend is not consistent when it comes to domain-specific NER. This work explores the relative performances and suitability of prominent word embeddings for cyber security NER task. Embeddings considered include both general-purpose pre-trained word embeddings (non-contextual and contextual) available in the public domain and task-adapted embedding generated by fine-tuning these pre-trained embeddings on a task-specific supervised dataset. The results indicate that when it comes to using pre-trained embeddings for cyber security NER, fastText performs better than GloVe and BERT. However, when embeddings are further fine-tuned for the cyber-NER task, the performance of all the fine-tuned embeddings improved by +2-7%. Further, BERT embedding fine-tuned using position-wise FFN (Feed Forward Network) produced the state-of-the-art 0.974 F1-Score on the cyber security NER dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI