Academic
Publications
Automatic document metadata extraction using support vector machines
Automatic document metadata extraction using support vector machines   (Citations: 118)
BibTex | RIS | RefWorks Download
Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a Support Vector Machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [17] and EbizSearch[24]. We believe it can be generalized to other digital libraries.
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...The header parser [17] extracts document title, author, abstract and affiliation information...

    Pradeep Teregowdaet al. Cloud Computing: A Digital Libraries Perspective

    • ...According to studies, the existing approaches achieve excellent accuracy, significantly above 90%, sometimes close to 100% [1, 2, 3]. However, all existing approaches for extracting titles from PDF files have two shortcomings...
    • ...Then, titles from the same PDFs were extracted with a Support Vector Machine from Cite-Seer [1] to compare results...

    Jöran Beelet al. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by An...

    • ...Manual [2], semiautomatic [9] [10] [15], and automatic techniques [1] [5] [6] [7] [8] [14] were proposed to accommodate changes of web pages and to reduce the cost for developing and maintaining the wrappers...

    Yaw-Huei Chenet al. Extracting Topics Information from Conference Web Pages Using Page Seg...

    • ...Automatic metadata extraction methodologies can be classified into two main categories: machine learning methods [4][5][7][11] and other methods which based on rules combined with dictionaries and ontology [8][10][12]...
    • ...According to [5], machine learning for information extraction include symbolic learning, inductive logic programming, grammar induction, Support Vector Machine, Hidden Markov models (HMMs), and statistical methods...
    • ...In paper [5], authors suggested using SVM for automatic metadata extraction...
    • ...In [7], authors also suggested automatic metadata extraction by using CRF (Conditional Random Fields) and their approach gave a comparable result with SVM in [5]...
    • ...In [5][7] the Precision is from 86 % to 99%, the Recall is from 45% to 100%, the Accuracy is from 96% to 100% (depends on various metadata)...
    • ...Similar to [5], we define these measures as following...

    Tin Huynhet al. GATE framework based metadata extraction from scientific papers

    • ...In addition to harvesting existing information, Machine learning technologies have been used for automatic metadata extraction, authors in [6] proposed a method to conduct metadata extraction from header part of scientific research papers...

    Sahar Changuelet al. A General Learning Method for Automatic Title Extraction from HTML Pag...

Order by: