- The first day of class is Monday, Jan 28. See you there!
- Please fill out the questionnaire.

**Lectures:**3-4pm MWF, in Hackerman 320.- Sometimes we may have to do 3-4:15 but this will be announced in advance.

**Instructor:**Jason Eisner**Office hours:**4-4:30pm after class, or by appointment.

**TA:**Sebastian Mielke**Office hours:**Friday, 4-5pm, outside Hackerman 321

**Email:**cs765-staff at cs.jhu.edu**Discussion site:**https://piazza.com/jhu/spring2019/601765/**Class notes:**- Formalisms and terminology
- Scribe Notes
- Readings (TBA)

**Video lectures:**via Blackboard

We will probably have 4 homeworks (plus a term project). Homeworks will be posted here:

- Homework 1: Slow general algorithms for sequence labeling

Homeworks are to be submitted on Gradescope, following the Piazza instructions.

Last year’s homework assignments are listed here for reference. You can expect some similar assignments this year.

- Homework 2: Efficient finite-state methods
- Homework 3: Neural models

*Catalog description*: This course surveys formal ingredients that are used to build structured models of character and word sequences. We will unpack recent deep learning architectures that consider various kinds of latent structure, and see how they draw on earlier work in structured prediction, dimensionality reduction, Bayesian nonparametrics, multi-task learning, etc. We will also examine a range of strategies used for inference and learning in these models. Students will be expected to read recent papers and carry out a research project. [Applications or Analysis]

*Prerequisites:* EN.600/601.465/665 or permission. Prior coursework in statistics or machine learning is recommended. Students may wish to prepare for their choice of research project by taking EN.601.382 Deep Learning Lab at the same time.

*Remarks:*

- The focus of the class is on understanding the space of
*good options*for designing probabilistic sequence models and computing with them. We will discuss the*qualitative*advantages and disadvantages of different options. Our goal is not to teach you exactly how today’s top-ranked system works, but rather to give you a toolbox for understanding and creating system designs. - This class builds on the dynamic programming algorithms and log-linear models covered in NLP. We will primarily extend to various neural (“log-nonlinear”) models, some of which allow dynamic programming.
- As this is a graduate class, the lecture style will be a bit more improvisational than in NLP. The class is also still under development. We will probably only cover a subset of the topics on the syllabus.

- Attending lectures
- Scribing? (i.e., drafting lecture notes)
- Reading papers?
- Homeworks
- Midterm exam?
- Final exam
- Final project

- Overview
- Sequence labeling as a canonical problem

- Notation and statistical background
- Algorithmic background: Paths in graphs
- Classical sequence labeling models
- Graphical models and belief propagation

- Beyond dynamic programming: Approximation algorithms
- Feature / architecture engineering
- Neuralization
- Word embeddings
- Backprop and optimization methods
- Hyperparameter tuning (model selection)
- Deep generative models

- Distributions over other discrete structures (trees, proofs)
- Transition systems for transduction and parsing
- Integration over hidden variables
- Reinforcement learning
- Continuous generalizations
- Kalman filters
- Poisson and Hawkes processes
- Gaussian processes
- Exchangeability
- Dirichlet processes
- Hierarchical modeling
- Types vs. tokens
- Infinite Gaussian mixture model
- Hierarchical Pitman-Yor language model
- Infinite HMM

- Lambek calculus / CCG / automata / other models of grammaticality
- Spectral learning
- Structure learning

In some sense, the point of the course is to explicitly show you the collection of design choices that you face when building probabilistic reasoning systems. Your choices will affect (1) how well your models fit the theory and the data, (2) the computational complexity of inference and training, and (3) the difficulty of implementing the system.

- Training objectives
- Joint vs. conditional
- Loss-infused training (train a policy) vs. loss-infused decoding (train a model)
- Forms of regularization
- Smooth vs. non-smooth objectives
- Convex vs. non-convex objectives
- End-to-end vs. pipelined training

- Inference objectives
- Maximization vs. summation; annealing
- Search and sampling
- Dual decomposition (for maximization)
- Variational approximation (for summation)

- Maximization vs. summation; annealing
- Modeling schemes
- Global vs. local - and how local? (= lookahead vs. heuristics)
- Graph-based vs. transition-based (= subgraph features vs. history-based features)
- Tractable vs. faithful models
- Domain knowledge vs. generic architectures

- Types vs. tokens
- Model structure
- Weighting the training data

- Computational tricks of the trade and implementation know-how