Repository > Results

Doctoral dissertation

Automatic text parsing aided by clause splitting and intra-clausal coordination detection

Author(s): Domen Marinčič (Author), Matjaž Gams (Supervisor), Tomaž Šef (Co-Supervisor)

Thesis defense date: 24.12.2008

Organization: MPŠ - Mednarodna podiplomska šola Jožefa Stefana

PID: 20.500.12556/ReVIS-13515

Download main file (14.7 MB)

Views: 12 | Downloads: 11

Abstract

In language technologies, syntactic parsing represents one of the possible intermediate
steps of text analysis in the applications such as machine translation, information
extraction, question answering, etc. Syntactic trees are often used to demonstrate the
structure of text. In the last decades, the dependency framework became a popular
syntactic representation, describing the relations among the constituents of the sentence
with dependency trees.
Parsing algorithms are used to infer dependency trees from text. In the data driven
approach to parsing, a corpus of text, manually annotated with the dependency trees,
serves for training and testing the algorithms. To estimate the accuracy of parsing, the
test sentences are parsed and the output is then compared to the manually created trees.
We propose a new algorithm for PArsing with Clause and Intraclausal coordination
Detection - PACID, which includes machine learning and the use of heuristic rules to
incorporate language specifics. The algorithm is used as an upgrade of the existing parsing
algorithms. For training and evaluating the algorithm, Slovene Dependency Treebank
(SDT) was used, a corpus of Slovene text annotated with dependency trees. The algorithm
constists of two phases:
1. Detection and reduction of clauses and intraclausal coordinations. First, the
candidates for reduction are detected by heuristic rules. Then, the algorithm
filters out false positives by machine learning classifiers. At the end, the remaining
candidates are reduced to meta tokens. This procedure iterates until no more clauses
and intraclausal coordinations can be retrieved.
2. Dependency tree construction. The sequences of text reduced in the first phase
are parsed by three dierent base parsing models and a newly developed rule-based
parser. At the end, the resulting trees are merged to form the final dependency tree
of the sentence.
The experiments have shown that the use of the algorithm PACID raises the parsing
accuracy by 1.27 percentage points (6.4% relative error reduction) compared to the plain
MSTP parser and by 1.91 percentage points compared to the plain Malt parser (9.2%
relative error reduction). The time complexity of the reduction algorithm equals O(n),
n being the number of tokens in the sentence, compared to the complexity O(n2) of the
MSTP parser and O(n) of the Malt parser, which were used to build the base parsing
models. Additional time consumption thus seems acceptable.
In summary, we have shown that decomposing large complex parsing problems to
smaller ones is beneficial in terms of improving overall parsing accuracy and that
additional information provided by the richly inflected languages can improve parsing
results.

Metadata

Work Type	Doctoral dissertation
Language	Slovenian
Organization	MPŠ - Mednarodna podiplomska šola Jožefa Stefana
PID	20.500.12556/ReVIS-13515
COBISS ID	22364967
UDK	502/504:502.3/.7
Thesis defense date	24.12.2008

Attachments

Attachment - academic_work_attachments/Domen_Marinci… (14.7 MB) MD5: 078ae445a55884652dbb65659ac24d18

Cite this work

Citation style:

Back to Search View in ReVIS View in COBISS

REPOSITORY > RESULTS

Automatic text parsing aided by clause splitting and intra-clausal coordination detection

Abstract

Metadata

Attachments

Cite this work