Table Of Contents

Previous topic

Coding Style Guidelines

Next topic

Working with large datasets in Pylearn2

This Page

Overview

This page gives a high-level overview of the Pylearn2 library and describes how the various parts fit together.

First, before learning Pylearn2 it is imperative that you have a good understanding of Theano. Before learning Pylearn2 you should first understand:

  • How Theano uses Variables, Ops, and Apply nodes to represent symbolic expressions.
  • What a Theano function is.
  • What Theano shared variables are and how they can make state persist between calls to Theano functions.

Once you have that under your belt, we can move on to Pylearn2 itself. Note that throughout this page we will mention several different classes and functions but not completely describe their parameters. The purpose of this page is to give you a basic idea of what is possible in Pylearn2, what kind of object is needed to accomplish various kinds of tasks, and where these objects are located. To find out exactly what each object or function does, you should read its docstring in the python code itself. As always, e-mail pylearn-dev@googlegroups.com if you find the documentation lacking, confusing, or wrong in any way.

The Model class

A good central starting point is the Model, defined in pylearn2.models.model.py. Most of Pylearn2 is designed either to train a Model or to do some analysis on a trained Model (evaluate its classification accuracy, etc.)

A Model can be almost anything. It can be a probabilistic model that generates data. It can be a deterministic model that maps from inputs to outputs, like a support vector machine. It can be as simple as a single Gaussian distribution or as complex as a deep Boltzmann machine. The main defining feature of a Models is just that it has a set of parameters represented as theano shared variables.

Training a Model while avoiding Pylearn2 entanglements

Part of the Pylearn2 vision is that users should be able to use only the pieces of the library that they want to. This way you don’t need to learn the entire library in order to make use of it. If you want to learn about as little of Pylearn2 as possible, you can simply implement two methods:

  • train_all(dataset): This method is passed a Pylearn2 Dataset. When called, it should do one pass through the dataset and update the parameters accordingly. For example, k-means could assign every point in the dataset to a centroid, then update the centroid to the mean of all the points belong to that centroid.
  • continue_learning(): This method should return True if the algorithm needs to continue learning. For example, the k-means algorithm could return True if the centroids moved more than a certain amount on the most recent pass. (If not trained at all, this should always return True).

We’ll say more about what a Pylearn2 Dataset is later. For now, you just need to know that it is a collection of data examples.

If you have a Model and a Dataset you can train them from your own python script like this:

while model.continue_learning():
    model.train_all(dataset)

This approach to training is simple and easy, and still allows for a lot of powerful functionality. We could imagine a more advanced while loop that inspects or modifies the model after each call to train_all (and indeed, this what the Train object does). However, it violates some of the key goals of the Pylearn2 Vision.

Suppose that the Model is an RBM with Gaussian visible units. There are several ways to train such RBMs. They can be trained with contrastive divergence, persistent contrastive divergence, score matching, denoising score matching, and noise contrastive estimation. Trying to capture all of these with a “train_all” method is a fool’s errand. In Pylearn2, we prefer to factor learning algorithms into many different pieces. If you don’t want to work with the entire Pylearn2 ecosystem, train_all is an acceptable way to implement learning in your model. However, an ideal Pylearn2 Model will contain very little functionality of its own. We will later describe how to use TrainingAlgorithms and Costs to make a Model that can be trained in many different ways.

First, however, we’ll describe how to train a Model using the main “control center” of Pylearn2: the Train object.

Training a Model with the Train object

In the previous section, we described how to train a Model with a simple python while loop. Pylearn2 provides a convenience class for training Models using several powerful additional features. This is the pylearn2.train.Train class.

At its bare minimum, the Train class takes a dataset and a model in its constructor. Calling the class’s main_loop method then makes it do the same thing as the simple while loop:

# This function...
def train(dataset, model):
    while model.continue_learning():
        model.train_all(dataset)

# does the same thing as this function...
def train(dataset, model):
    loop = Train(dataset, model)
    loop.main_loop()

The Train class makes it easy to use several of Pylearn2’s other more powerful features. For example, by specifying the save_path and save_freq arguments, the Train object can save your model after each epoch (an “epoch” is a call to train_all). The save can be configured not to overwrite pre-existing files. Also, note that on the second iteration requiring a save and all subsequent such iterations, the Train object will need to overwrite a file that it made earlier. When doing so, it will make a backup of the file that it removes if the write succeeds. If your job dies in the middel of the write, you can still recover the backup.

These are just a few examples of the extra convenient features that come from using the pylearn2 Train object. Later we will describe how to make it more powerful using callbacks and training algorithm plugins.

Configuring the Train object with a YAML file

Pylearn2 has to solve many problems that are not directly related to machine learning and as a result, not all that interesting to Pylearn2’s developers, who are a bunch of machine learning grad students.

One of these problems is how to associate a Model with a polymorphic Dataset object and have this association persist after the Model has been serialized, for example to disk. Usually Datasets are big enough that we don’t want to serialize a whole copy of the Dataset with every Model we save. But we preprocess Datasets enough that we don’t want to store every possible preprocessed dataset and give each model a hard link to the preprocessed dataset.

Instead, we currently solve this problem by using the YAML markup language to store a description of how to make the Dataset in the Model. This is not necessarily an ideal solution, and if you have a better idea and some willingness to implement it, please let us know.

First, a brief explanation of why you want to associate a Dataset with a Model. Suppose your model is a classifer that works on the MNIST dataset. After training the Model, you might want to look at the weights it has learned. MNIST consists of 28 x 28 pixel greyscale images, but the weights are probably stored as 784-element vectors. In order for Pylearn2’s visualization functionality to render the weights correctly as an image, it will need to know that the model was trained on the MNIST dataset. This is just one example of why correct analysis of a saved model requires knowledge of the dataset.

Annotating every Model with a correctly-formatted yaml string specifying the dataset would be an annoying and error-prone thing to do from inside a python script. Instead, we try to make all of our training jobs specifiable via a yaml file. The substring of that file used to describe the dataset is then stored in the Model. If you provide a bad description of the dataset, the job doesn’t run, so the description in the Model should always be correct. The only way it can go wrong is if conditions change between when the Model is serialized and when it is deserialized (for example, if the yaml string refers to a file that gets moved, or if you deserialize the model file on a machine that has the dataset file in a different place).

Using YAML for everything is thus not something we would ideally like to have designed into the library, but for now it is the easiest way to make your life convenient.

To run an experiment using only a yaml string, the easiest thing to do is save a yaml file with these contents:

!obj:pylearn2.train.Train {
        dataset : FILL IN DESCRIPTION OF DATASET HERE
        model : FILL IN DESCRIPTION OF MODEL HERE
        }

You can then run the experiment with this command:

train.py my_train_job.yaml

Here, train.py is a script stored in pylearn2/scripts. Examples throughout the Pylearn2 documentation will assume that this is in your PATH variable.

The Monitor

The Monitor is an object that monitors training of a Model. If you use the Train object it will be set up and maintained for you automatically, so we won’t go into the details of how that works here.

The Monitor keeps track of several “channels” on various datasets. A channel might be something like a classifier’s accuracy, in which case the monitor could have one channel called “train_acc” tracking accuracy on the training set and another called “valid_acc” tracking accuracy on a held-out validation set.

The main way that you as a user add channels is by implementing the get_monitoring_channels method of various objects like Models and Costs. You’ll need to tell the training algorithm which datasets to monitor on.

It’s possible for you to view the channels of a saved model using the plot_monitor.py script. The Monitor is also an important part of many training algorithms because the algorithms have the ability to react to values in the monitor. For example, many training algorithms use “early stopping,” where an iterative training method quits after accuracy on some validation set starts to drop, in order to prevent overfitting.

TrainingAlgorithm

A TrainingAlgorithm is an object that interacts with a Model to update the Model’s parameters given some data. If you supply a TrainingAlgorithm to the Train object, it will call the TrainingAlgorithm to do each epoch, rather than calling the Model’s train_all method.

Pylearn2 includes some basic yet powerful training algorithms, SGD (Stochastic Gradient Descent) and BGD (Batch Gradient Descent). Both of these algorithms work by minizing some cost function of the model parameters and the training data. For example, to fit a logistic regression classifier, you could minimize -log P(Y | X) where Y is training labels and X is training input features. Both of them are configurable to perform more advanced optimization algorithms. For example, BGD can be configured to do nonlinear conjugate gradient descent. SGD supports user-provided callbacks that can provide powerful extra features. Pylearn2 uses this callback system to implement Polyak averaging.

Both of these training algorithms also accept a TerminationCriterion object that determines when to stop learning. A common practice is for the TerminationCriterion to look at channels in the monitor to determine when a learning algorithm has converged. This is how we implement early stopping as mentioned above.

TrainingCallback

The Train object itself also supports callbacks that run after each iteration. These allow further customization of the training procedure. For example, the OneOverEpoch callback can be used to make the SGD algorithm use a 1/t learning rate schedule instead of a fixed learning rate. You can also pass in user-defined callbacks so the sky is the limit.

Cost

A Cost is an object that describes an objective function. Pylearn2 uses these to control both the SGD and BGD algorithms. A Cost describes the objective function both in terms of its value, given some inputs and a model to provide the parameters, and its gradients. If no custom implementation of the gradients is provided, the default implementation simply uses Theano’s symbolic differentiation system to provide the gradients.

However, you can accomplish more powerful functionality by returning other gradients. For example, the log likelihood of an RBM is intractable, but its gradient can be approximate with a sampling procedure. You could implement RBM training with the SGD object by returning None for the objective function value but providing sampling-based approximations to the gradient.

Datasets

One of the key principles of Pylearn2 development is that we make features when we need them. One place where this is pretty obvious is the Dataset section of the library. Most people that actually use pylearn2 use it to train complicated models like RBMs on little image patch datasets like MNIST or CIFAR-10. This is one section of the library you can really expect to grow as we move to datasets that don’t fit in memory, use more exotic data types, etc.

For now, there are two basic ways data can be viewed: as a design matrix or as a “topological view.” The topological view is basically the true underlying form of the data. A batch of image data is a 4-tensor with one axis indexing into different images, one axis indexing into image channels (red, green, blue), and two topological axes indexing the rows and columns of the image. In the design matrix view, the image is flattened into a row vector of size rows * columns * channels.

Different pieces of pylearn2 functionality may ask the dataset for different views of the same data. The Dataset object should be able to provide either view at any time. You might want to train a densely connected RBM on design matrices, or a convolutional MLP on topological views, or you might just want to display an image of a topological view to the user.

The datasets directory includes a Preprocessor class that can modify a Dataset. This should not be considered a way of implementing view conversion, since a Dataset should always be ready to provide either view of its data rapidly. Preprocessors are powerful and can modify datasets in many ways, including changing their number of examples or what kind of data they store. For example, you might want to preprocess the CIFAR-10 dataset of 50,000 32x32 images into a dataset of 200,000 6x6 image patches.

Many common preprocessing operations can be easily represented by Blocks, a kind of Pylearn2 class that takes a batch of input examples, processes them independently, and returns a batch of output examples. To avoid separately implementing the same operation as both a Block and a Preprocessor, there is a generic BlockPreprocessor that can map a Block over a whole Dataset.

Preprocessors are used when you have an expensive operation that you want to do once and save. If you have cheap preprocessing that you would like to do on the fly for each batch, you can do this with the TransformerDataset.

Datasets should be considered relatively abstract entities; in particular we should not assume that they contain a finite number of examples, or that jumping to a particular example index is easy. For example, consider the dataset of small patches drawn from a collection of large, irregularly sized images. Due to the irregular size of the images we need to know the size of each of them in advance if we want to jump to patch number 543,286,932. pylearn2.utils.iteration provides interfaces for abstractly iterating through examples. It supports iteration schemes such as continually drawing random examples from an endless stream.

Spaces and LinearTransforms

Many of our models can be described largely in terms of linear transformations from one space to another. Often changing the exact type of linear transformation can be a useful modeling operation. For example, to scale RBMs to work on large images, we can replace all the matrix multiplies in the model description with convolutions. Pylearn2 uses the TheanoLinear library to provide abstract LinearTransforms. If you write your model in terms of transforming from one Pylearn2 Space to another using a LinearTransform, then your model can easily be converted between dense, convolutional, tiled convolutional, locally connected without weight sharing, etc. just by plugging in the right LinearTransform with compatible Spaces.

Analyzing saved models

A variety of scripts let you analyze your saved models:
  • plot_monitor.py lets you plot channels recorded in the model’s monitor
  • print_monitor.py prints out the final value of each channel in the monitor.
  • summarize_model.py prints out a few statistics of the model’s parameters
  • show_weights.py displays a visualization of the weights of your model’s first layer.