MPŠ MP&Scaron MP&Scaron MP&Scaron Avtorji

Jožef Stefan
Postgraduate School

Jamova 39
SI-1000 Ljubljana

Phone: +386 1 477 31 00
Fax: +386 1 477 31 10


Course Description

Language Technologies


Information and Communication Technologies, second-level study programme


prof. dr. Tomaž Erjavec


The goal of this course is to introduce language technologies, i.e. methods and applications of computer processing of natural language. The course gives the history and basic concepts of linguistics, various applications of language technologies and the computational methods which they use. Particular attention is given to language corpora, large datasets of annotated texts, which serve as the basic infrastructure necessary for research and processing of individual languages. Also discussed is the analysis of language corpora with machine learning methods. The focus of the course is on the processing of Slovene language.

Students will gain basic theoretical understanding and practical knowledge of language technologies and computational and corpus linguistics, which is a prerequisite for effective work on computer processing of language data.


Development of linguistics and computational linguistics, complexity of language, levels of linguistic analysis, overview of applications and methods.

Language corpora:
Purpose, history and typology, annotation, use cases, computer coding, examples.

Methods of computer processing:
Regular expressions and finite state automata, phrase-structure grammars, statistical methods, machine learning.

Corpus analysis with machine learning methods:
Relevant methods of machine learning, use cases: automatic morphological, syntactic and semantic annotation.

Encoding standards:
History of standardisation, coding of characters, XML, Text Encoding Initiative, MULTEXT, ISO, evaluation methods.

Information retrieval and extraction, machine translation, speech technologies, digital libraries, etc.

Course literature:

• D. Jurafsky, and J.H. Martin, Speech and Language Processing, 2nd Edition. Prentice-Hall, 2008. ISBN 978-0131873216
• R. Mitkov, Ed. The Oxford Handbook of Computational Linguistics. Oxford University Press, 2003. ISBN 978-0-19-823882-9
• C. Manning, and H. Schütze, Foundations of Statistical Natural Language Processing. MIT Press. 1999. ISBN 0-262-13360-1

Significant publications and references:

• T. Erjavec, The IMP historical Slovene language resources. Language resources and evaluation, 23 str., doi: 10.1007/s10579-015-9294-7, 2015.
• T. Erjavec, MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Language resources and evaluation, vol. 46, no. 1, str. 131-142, 2012.
• N. Ljubešić, and T. Erjavec, hrWac and sIWac: compiling web corpora for Croatian and Slovene. Text, speech and dialogue: proceedings, (Lecture notes in computer science, ISSN 0302-9743, Lecture notes in artifical intelligence, 6836). Berlin; Heidelberg: Springer, vol. 9743, str. 395-402, 2011.
• N. Logar, M. Grčar, M. Brakus, T. Erjavec, Š. Arhar Holdt, and S. Krek, Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES : gradnja, vsebina, uporaba, (Zbirka Sporazumevanje). Ljubljana: Trojina, zavod za uporabno slovenistiko: Fakulteta za družbene vede, 208 str., 2012.
• M. Juršič, I. Mozetič, T. Erjavec, Lavrač, N., LemmaGen: multilingual lemmatisation with induced Ripple-Down rules. Journal of universal computer science, vol. 16, no. 9, str. 1190-1214, 2010.


Seminar and oral exam (100%)

Students obligations:

Seminar and oral exam