hide
Free keywords:
Automatic data extraction, Polymer data, Table, XML documents, Polymer-name recognition
Abstract:
Automatic extraction of polymer data from tables in scientific articles was examined using table matrix structures. XML documents of articles were used to accurately reproduce tables by constructing the matrix structures in plain text. By utilizing XML tags that systematically manage contents in XML documents, such as simple tables, complicated column- and row-span tables, and fused tables were accurately reproduced. After table reproduction, four processes of data formatting for machine readability, polymer- and property-name recognition, and polymer data extraction were performed. In polymer-name recognition, our original recognizer was used. The recognizer was prepared through automatic annotation using our rule-based program based on typical character patterns of polymer full names and abbreviations and deep neural network learning of polymer names. In property-name recognition, partial string-matching using polymer property index terms and stop words was performed. In this study, glass transition temperature (Tg), melting temperature (Tm), and decomposition temperature (Td) were selected as the target polymer properties. Through these five processes, 2,043 data for Tg, 1,436 for Tm, and 2,183 for Td were extracted from approximately 18,000 scientific articles of Elsevier, and the F scores for the extraction were 0.896, 0.876, and 0.837, respectively. These results indicate that the automatic extraction system created in this study can efficiently and accurately collect masses of polymer data from a large number of scientific articles.