Statistical Analysis of Corpus Data with R is an online course by Marco Baroni and Stefan Evert. It is based on a number of previous courses on similar topics taught together by the authors, in particular the course on R Programming for (Computational) Linguists given at the DGfS Fall School in Computational Linguistics (Potsdam, 2007).

 News:  the SIGIL course is currently being restructured – a new Web page will be launched in Spring 2011

Temporary Downloads

 Notice:  please install version 0.4-3 of the corpora package: source code (binary versions are available on CRAN for R 2.15.0 and newer)

You may want to get the ZIP archive with most data sets (1.2 MiB) instead of downloading each file separately.

Course Materials

Data Sets

Download ZIP archive with all data sets (1.2 MB).

* These files contain Unicode strings with accented characters. If you are running R on a Windows computer, specify the option encoding="UTF-8" when loading the files with read.delim() in order to handle such strings correctly.

Exercises