计算机科学
嵌入
元数据
领域(数学分析)
支持向量机
维数(图论)
人工智能
情报检索
数据挖掘
机器学习
模式识别(心理学)
万维网
数学
数学分析
纯数学
作者
Majid Ghasemi Gol,Jay Pujara,Pedro Szekely
标识
DOI:10.1109/icdm.2019.00033
摘要
There is a large amount of data on the web in tabular form, such as excel sheets, CSVs, and web tables. Often, tabular data is meant for human consumption, using data layouts that are difficult for machines to interpret automatically. Previous work uses the stylistic features of tabular cells (e.g. font size, border type, background color) to classify tabular cells by their role in the data layout of the document (top attribute, data, metadata, etc.). In this paper, we propose a method to embed the semantic and contextual information about tabular cells in a low dimension cell embedding space. We then propose an RNN-based classification technique to use these cell vector representations, combining them with stylistic features introduced in previous work, in order to improve the performance of cell type classification in complex documents. We evaluate the performance of our system on three datasets containing documents with various data layouts, in two settings, in-domain, and cross-domain training. Our evaluation result shows that our proposed cell vector representations in combination with our RNN-based classification technique significantly improves cell type classification performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI