601.765 Machine Learning: Linguistic & Sequence Modeling
Spring 2019
Announcements
- The first day of class is Monday, Jan 28. See you there!
- Please fill out the questionnaire.
Administration
- Lectures: 3-4pm MWF, in Hackerman 320.
- Sometimes we may have to do 3-4:15 but this will be announced in advance.
- Instructor: Jason Eisner
- Office hours: 4-4:30pm after class, or by appointment.
- TA: Sabrina Mielke
- Office hours: Friday, 4-5pm, outside Hackerman 321
- Email: cs765-staff at cs.jhu.edu
- Discussion site: https://piazza.com/jhu/spring2019/601765/
- Class notes:
- Video lectures: via Blackboard
Homework
There are 4 homeworks:
- Homework 1: Slow general algorithms for sequence labeling
- Homework 2: Efficient finite-state methods
- Homework 3: Neural models
- Homework 4: A slightly different introduction to Deep Reinforcement Learning
Homeworks are to be submitted on Gradescope, following the Piazza instructions.
Course overview
Catalog description: This course surveys formal ingredients that are used to build structured models of character and word sequences. We will unpack recent deep learning architectures that consider various kinds of latent structure, and see how they draw on earlier work in structured prediction, dimensionality reduction, Bayesian nonparametrics, multi-task learning, etc. We will also examine a range of strategies used for inference and learning in these models. Students will be expected to read recent papers and carry out a research project. [Applications or Analysis]
Prerequisites: EN.600/601.465/665 or permission. Prior coursework in statistics or machine learning is recommended. Students may wish to prepare for their choice of research project by taking EN.601.382 Deep Learning Lab at the same time.
Remarks:
- The focus of the class is on understanding the space of good options for designing probabilistic sequence models and computing with them. We will discuss the qualitative advantages and disadvantages of different options. Our goal is not to teach you exactly how today’s top-ranked system works, but rather to give you a toolbox for understanding and creating system designs.
- This class builds on the dynamic programming algorithms and log-linear models covered in NLP. We will primarily extend to various neural (“log-nonlinear”) models, some of which allow dynamic programming.
- As this is a graduate class, the lecture style will be a bit more improvisational than in NLP. The class is also still under development. We will probably only cover a subset of the topics on the syllabus.
Requirements (details TBA)
- Attending lectures
- Scribing? (i.e., drafting lecture notes)
- Reading papers?
- Homeworks
- Midterm exam?
- Final exam
- Final project
Topic list
Setting the stage
- Overview
- Sequence labeling as a canonical problem
Classical methods
- Notation and statistical background
- Algorithmic background: Paths in graphs
- Classical sequence labeling models
- Graphical models and belief propagation
Richer scoring functions
- Beyond dynamic programming: Approximation algorithms
- Feature / architecture engineering
- Neuralization
- Word embeddings
- Backprop and optimization methods
- Hyperparameter tuning (model selection)
- Deep generative models
Beyond sequence labeling
- Distributions over other discrete structures (trees, proofs, program runs)
- Transition systems for transduction and parsing
- Integration over hidden variables
- Reinforcement learning
- Continuous generalizations
- Kalman filters
- Poisson and Hawkes processes
- Gaussian processes
- Exchangeability
- Hierarchical modeling
- Types vs. tokens
- Infinite Gaussian mixture model
- Hierarchical Pitman-Yor language model
- Infinite HMM
Other possible topics (time permitting)
- Lambek calculus / CCG / automata / other models of grammaticality
- Spectral learning
- Structure learning
Recurring themes
In some sense, the point of the course is to explicitly show you the
collection of design choices that you face when building probabilistic
reasoning systems. Your choices will affect (1) how well your models
fit the theory and the data, (2) the computational complexity of inference
and training, and (3) the difficulty of implementing the system.
- Training objectives
- Joint vs. conditional
- Loss-infused training (train a policy) vs. loss-infused decoding (train a model)
- Forms of regularization
- Smooth vs. non-smooth objectives
- Convex vs. non-convex objectives
- End-to-end vs. pipelined training
- Inference objectives
- Maximization vs. summation; annealing
- Search and sampling
- Dual decomposition (for maximization)
- Variational approximation (for summation)
- Modeling schemes
- Global vs. local - and how local? (= lookahead vs. heuristics)
- Graph-based vs. transition-based (= subgraph features vs. history-based features)
- Tractable vs. faithful models
- Domain knowledge vs. generic architectures
- Types vs. tokens
- Model structure
- Weighting the training data
- Computational tricks of the trade and implementation know-how