MPŠ
MPŠ MP&Scaron MP&Scaron MP&Scaron Avtorji

Jožef Stefan
International
Postgraduate School

Jamova 39
SI-1000 Ljubljana
Slovenia

Phone: +386 1 477 31 00
Fax: +386 1 477 31 10
Email: info@mps.si

Search

Course Description

Advanced Language Technologies

Program

Information and Communication Technologies, third-level study programme

Lecturers:

prof. dr. Tomaž Erjavec

Goals:

Language technologies comprise methods and applications of computer processing of natural language.

Students will gain basic theoretical understanding and practical experience of language technologies and computational linguistics, which is a prerequisite for effective work on computer processing of language data.

The course objectives are to (a) introduce the basics of language technologies, (b) present the coding and annotation of language resources, and (c) present selected methodologies and techniques used in language technologies. The focus of the course is on the processing of Slovene language.

The students will master the basics of language technologies and will be capable of using selected methods and tools in practice.

Content:

Introduction:
Development of linguistics and computational linguistics, complexity of language, levels of linguistic analysis, overview of applications and methods.

Language corpora:
Purpose, history and typology, annotation, use cases, computer coding, specific examples.

Methods of computer processing:
Regular expressions and finite state automata, phrase-structure grammars, statistical methods, machine learning.

Corpus analysis with machine learning methods:
Relevant methods of machine learning, use cases: automatic morphological, syntactic and semantic annotation.

Encoding standards:
History of standardisation, coding of characters, XML, Text Encoding Initiative, ISO, evaluation methods.

Applications:
Information retrieval and extraction, machine translation, speech technologies, digital libraries, etc.

Course literature:

Selected chapters from the following books:

• D. Jurafsky, and J.H. Martin, Speech and Language Processing, 2nd Edition. Prentice-Hall, 2008. ISBN 978-0131873216
• R. Mitkov, Ed. The Oxford Handbook of Computational Linguistics. Oxford University Press, 2003. ISBN 978-0-19-823882-9
• C. Manning, and H. Schütze, Foundations of Statistical Natural Language Processing. MIT Press. 1999. ISBN 0-262-13360-1

Significant publications and references:

• T. Erjavec, The IMP historical Slovene language resources. Language resources and evaluation, 23 str., doi: 10.1007/s10579-015-9294-7, 2015.
• T. Erjavec, MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language resources and evaluation, vol. 46, no. 1, str. 131-142, 2012.
• N. Ljubešić, and T. Erjavec, hrWac and sIWac: compiling web corpora for Croatian and Slovene. Text, speech and dialogue : proceedings, (Lecture notes in computer science, ISSN 0302-9743, Lecture notes in artifical intelligence, 6836). Berlin; Heidelberg: Springer, vol. 9743, str. 395-402, 2011.
• N. Logar, M. Grčar, M. Brakus, T. Erjavec, Š. Arhar Holdt, and S. Krek, Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES : gradnja, vsebina, uporaba, (Zbirka Sporazumevanje). Ljubljana: Trojina, zavod za uporabno slovenistiko: Fakulteta za družbene vede, 208 str., 2012.
• M. Juršič, I. Mozetič, T. Erjavec, N. Lavrač, LemmaGen: multilingual lemmatisation with induced Ripple-Down rules. Journal of universal computer science, vol. 16, no. 9, str. 1190-1214, 2010.

Examination:

Written or oral exam (50%)
Seminar work with oral defense (50%)

Students obligations:

Written or oral exam
Seminar work with oral defense

Links: