A Consistent and Efficient Estimator for Data-Oriented Parsing

A Consistent and Efficient Estimator for Data-Oriented Parsing,Journal of Automata, Languages and Combinatorics,Andreas Zollmann,Khalil Sima'an

A Consistent and Efficient Estimator for Data-Oriented Parsing   (Citations: 8)
BibTex | RIS | RefWorks Download
Given a sequence of samples from an unknown probability distribution, a statistical esti- mator aims at providing an approximate guess of the distribution by utilizing statistics from the samples. One crucial property of a 'good' estimator is that its guess ap- proaches the unknown distribution as the sample sequence grows large. This property is called consistency. This paper concerns estimators for natural language parsing under the Data- Oriented Parsing (DOP) model. The DOP model specifles how a probabilistic grammar is acquired from statistics over a given training treebank, a corpus of sentence-parse pairs. Recently, Johnson (15) showed that the DOP estimator (called DOP1) is biased and inconsistent. A second relevant problem with DOP1 is that it sufiers from an overwhelming computational ine-ciency. This paper presents the flrst (nontrivial) consistent estimator for the DOP model. The new estimator is based on a combination of held-out estimation and a bias toward parsing with shorter derivations. To justify the need for a biased estimator in the case of DOP, we prove that every non-overfltting DOP estimator is statistically biased. Our choice for the bias toward shorter derivations is justifled by empirical experience, mathematical convenience and e-ciency considerations. In support of our theoretical results of consistency and computational e-ciency, we also report experimental results with the new estimator.
Journal: Journal of Automata, Languages and Combinatorics - JALC , vol. 10, no. 2/3, pp. 367-388, 2005
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
Sort by: