Automatic Identification of Research Articles from Crawled Documents

Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha G. Das, Madian Khabsa, Pradeep Teregowda, C. Lee Giles

January 2014

Abstract

Online digital libraries that store and index research articles not only make it easier for researchers to search for scientific information, but also have been proven as powerful resources in many data mining, machine learning and information retrieval applications that require high-quality data. The quality of the data available in digital libraries highly depends on the quality of a classifier that identifies research articles from a set of crawled documents, which in turn depends, among other things, on the choice of the feature representation. The commonly used “bag of words” representation for document classification can result in prohibitively high dimensional input spaces and may not capture the specifics of research articles. In this paper, we propose novel features that result in effective and efficient classification models for automatic identification of research articles. Experimental results on two datasets compiled from the CiteSeerX digital library show that our models outperform strong baselines using a significantly smaller number of features.

Type

Conference paper

Publication

Web-Scale Classification: Classifying Big Data from the Web