# Lecture: Statistical Natural Language Processing

## Content

Humanity generate exabytes of data every year. Most of this data is available in some rendition of natural language (in particular text). Hence, the inclusion of textual data sources is of growing importance in large-scale data-driven applications. A popular application scenario for this use are personal assistants (Siri, Google Home, Cortana, etc.), which rely partly on Web pages to extract of select answers to user questions. Processing large amounts of text in a semantically sound manner however turns out to be rather difficult for machines. The goal of this lecture is to provide students with insights in approaches based mostly on probabilistic models, which aim to facilitate the implementation of pipelines for processing natural language text. The lecture is structured as follows:

- Finite-state automata
- Language models
- Spell checkers
- Deduplication
- Classification
- Hidden Markov Models
- Grammar and semantics
- Parsing natural language
- Word Sense Disambiguation
- Distributional semantics

## Structure

The course consists of:

**A lecture**: 2h/week, slides uploaded after the lecture**Six series of coding exercises**evaluated automatically through an online platform. Students are required to reach at least 50% of the points and submit at least 60% of the exercises to be allowed to participate in the exam. The exercises are discussed during a bi-weekly seminar.**A mini-project**: The goal of the mini-project is to apply the content of the lecture to a practical problem and to implement a non-trivial solution to said problem. Groups of up to 3 persons are allowed, as long as the portion of the work carried out by each student can be identified clearly. The solution is evaluated automatically on a benchmark against a non-trivial but baseline solution to the same problem. Students must outperform the baseline to be allowed to participate in the exam. Moreover, a short document (12-15 pages, written using the provided LaTeX template) explaining the solution implemented by the students and a link to clearly commented code are a prerequisite to complete this requirement for the exam.

## Exam

The exam lasts 90 minutes. The students are expected to answer both theoretical questions (e.g., what are the time and space complexity of a particular algorithm) and practical questions (e.g., write a regular expression to extract all occurrences of “mouse” from a piece of text).