Statistical Analysis of Corpus Data with R
A Gentle Introduction for Computational Linguists and Similar Creatures
Course Materials –
Data Sets –
Exercises –
SIGIL Main Page
Statistical Analysis of Corpus Data with R is an online course by Marco Baroni and Stefan Evert. It is based on a number of previous courses on similar topics taught together by the authors, in particular the course on R Programming for (Computational) Linguists given at the DGfS Fall School in Computational Linguistics (Potsdam, 2007).
News:
the SIGIL course is currently being restructured – a new Web page will be launched in Spring 2011
Temporary Downloads
back to top
Notice:
please install beta
version 0.4-1 (update on
25.04.2010) of the corpora package:
Windows –
Mac OS X –
source code (Linux)
You may want to get the ZIP archive with most data sets (1.2 MiB) instead of downloading each file separately.
-
Unit 1: General introduction / First steps in R
-
Unit 2: Corpus frequency data & statistical inference
(updated on 21.10.2011) new
-
Unit 3: Descriptive and inferential statistics for continuous data
-
Unit 4: Collocations & contingency tables
-
Unit 5: Word frequency distributions and Zipf's law: Using add-on packages
-
Unit 6: Regression and the general linear model
-
Unit 7: Exploratory data analysis: Clustering, visualisation & machine learning
-
Unit 8: The non-randomness of corpus data & generalised linear models
(updated on 26.03.2010)
-
Unit 9: Inter-annotator agreement
Course Materials
back to top
- Introduction
(slides,
handout)
- Hypothesis tests for corpus frequency data
(slides,
handout)
- Word frequency distributions with zipfR
(slides,
handout)
- Clustering and dimensionality reduction
(slides,
handout,
data sets)
- Using statistical association measures for collocation extraction
- Part 1: contingency tables and association scores
(slides,
handout)
- Part 2: large-scale processing and evaluation
(slides,
handout)
- The limitations of random sampling methods
(slides,
handout)
- A short introduction to the mathematics of regression and linear models
(slides,
handout,
R examples)
- Statistical models
- Collected R code (ZIP archive) from handouts
- Some other sample R scripts (ZIP archive) with detailed comments
Data Sets
back to top
- brown.stats.txt (basic type-token statistics for the Brown corpus)
- lob.stats.txt (basic type-token statistics for the LOB corpus)
- bnc_metadata.tbl* (metadata information from the British National Corpus)
- bigrams.100k.spc (frequency spectrum of bigrams from the first 100k tokens of Brown)
- bigrams.100k.tfl (type frequency list of bigrams from the first 100k tokens of Brown)
- bigrams.vgc (vocabulary growth curve of bigrams in the Brown corpus)
- comp.stats.txt* (distributional information for different types of Italian noun-noun compounds)
- brown_bigrams.tbl (bigram collocations in the Brown corpus, with full contingency tables)
- krenn_pp_verb.tbl* (German PP-verb collocations with manual MWE annotation)
- bnc_gender_small.tbl (data set for identification of author gender in the BNC)
Download ZIP archive with all data sets (1.2 MB).
* These files contain Unicode strings with accented characters. If you are running R on a Windows computer, specify the option encoding="UTF-8" when loading the files with read.delim() in order to handle such strings correctly.
Exercises
back to top