Views: 12 | Downloads: 11
In language technologies, syntactic parsing represents one of the possible intermediate
steps of text analysis in the applications such as machine translation, information
extraction, question answering, etc. Syntactic trees are often used to demonstrate the
structure of text. In the last decades, the dependency framework became a popular
syntactic representation, describing the relations among the constituents of the sentence
with dependency trees.
Parsing algorithms are used to infer dependency trees from text. In the data driven
approach to parsing, a corpus of text, manually annotated with the dependency trees,
serves for training and testing the algorithms. To estimate the accuracy of parsing, the
test sentences are parsed and the output is then compared to the manually created trees.
We propose a new algorithm for PArsing with Clause and Intraclausal coordination
Detection - PACID, which includes machine learning and the use of heuristic rules to
incorporate language specifics. The algorithm is used as an upgrade of the existing parsing
algorithms. For training and evaluating the algorithm, Slovene Dependency Treebank
(SDT) was used, a corpus of Slovene text annotated with dependency trees. The algorithm
constists of two phases:
1. Detection and reduction of clauses and intraclausal coordinations. First, the
candidates for reduction are detected by heuristic rules. Then, the algorithm
filters out false positives by machine learning classifiers. At the end, the remaining
candidates are reduced to meta tokens. This procedure iterates until no more clauses
and intraclausal coordinations can be retrieved.
2. Dependency tree construction. The sequences of text reduced in the first phase
are parsed by three dierent base parsing models and a newly developed rule-based
parser. At the end, the resulting trees are merged to form the final dependency tree
of the sentence.
The experiments have shown that the use of the algorithm PACID raises the parsing
accuracy by 1.27 percentage points (6.4% relative error reduction) compared to the plain
MSTP parser and by 1.91 percentage points compared to the plain Malt parser (9.2%
relative error reduction). The time complexity of the reduction algorithm equals O(n),
n being the number of tokens in the sentence, compared to the complexity O(n2) of the
MSTP parser and O(n) of the Malt parser, which were used to build the base parsing
models. Additional time consumption thus seems acceptable.
In summary, we have shown that decomposing large complex parsing problems to
smaller ones is beneficial in terms of improving overall parsing accuracy and that
additional information provided by the richly inflected languages can improve parsing
results.