Train Deep Learning Models on AWS

by Oleg Polosin - 8 Nov 2018

A real-life example of how to train a Deep Learning model on an AWS Spot Instance using Spotty

Spotty is a tool that simplifies training of Deep Learning models on AWS.

Why will you ❤️this tool?

  • it makes training on AWS GPU instances as simple as a training on your local computer
  • it automatically manages all necessary AWS resources including AMIs, volumes and snapshots
  • it makes your model trainable on AWS by everyone with a couple of commands
  • it detaches remote processes from SSH sessions
  • it saves you up to 70% of the costs by using Spot Instances

To show how it works, let’s take a non-trivial model and try to train it. I chose one of the implementations of Tacotron 2. It’s a speech synthesis system by Google.

Clone the repository of Tacotron 2 to your computer:

git clone https://github.com/Rayhane-mamah/Tacotron-2.git

Docker Image

Spotty trains models inside a Docker container. So we need to either find a publicly available Docker image that satisfies the model’s requirements, or create a new Dockerfile with a proper environment.

This implementation of Tacotron uses Python 3 and TensorFlow, so we could use the official Tensorflow image: tensorflow/tensorflow-gpu-p3. But this image doesn’t satisfy all the requirements from the “requirements.txt” file. So we need to extend this image and install all necessary libraries on top.

Create the Dockerfile file in the root directory of the project:

FROM tensorflow/tensorflow:latest-gpu-py3

WORKDIR /root

# install pyaudio library
RUN apt-get update \
  && apt-get install -y python3-pyaudio \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

# install other requirements
COPY requirements.txt requirements.txt
RUN grep -v '^pyaudio' requirements.txt > requirements_updated.txt \
&& pip3 install -r requirements_updated.txt

Here we’re extending the original TensorFlow image and installing all other requirements (I couldn’t install the pyaudio library through pip, so I did it using apt).

Also, create the .dockerignore file with the following content:

# ignore everything
**

# allow only requirements.txt file
!/requirements.txt

Otherwise, you would get an out-of-space error, because Docker will be copying the entire build context (including heavy “training_data/” directory) to the Docker daemon.


Spotty Configuration File

Once we have the Dockerfile, we’re ready to write a Spotty configuration file. Create the spotty.yaml file in the root directory of the project.

It consists of 3 sections: project, instance and scripts.


Section 1: Project

project:
name: Tacotron2
remoteDir: /workspace/project
syncFilters:
  - exclude:
      - .idea/*
      - .git/*
      - '*/__pycache__/*'
- training_data/*

The section contains the following parameters:

  1. Name of the project: The name will be used in names of AWS resources. For example, in the name of the S3 bucket that will be used to synchronize the project code with the instance.
  2. Remote directory: It’s a directory where the project will be stored on the instance.
  3. Synchronization filters: Filters are being used to exclude directories which shouldn’t be synchronized with the instance. For example, we ignore PyCharm configuration, Git files, Python cache files and training data.

Section 2: Instance

instance:
region: us-east-2
instanceType: p2.xlarge
volumes:
  - name: Tacotron2
    directory: /workspace
    size: 50
docker:
  file: Dockerfile
  workingDir: /workspace/project
  dataRoot: /workspace/docker
ports: [6006, 8888]

The section contains the following parameters:

  1. Region: AWS region where a Spot Instance will be launched.
  2. Instance type: Type of AWS EC2 instance.
  3. List of volumes: Each volume has a name, a directory where the volume will be mounted, and a size.

    When you’re starting an instance the first time, the volume will be created. When you’re stopping the instance, a snapshot will be taken and automatically restored next time.
  4. Docker: Here we set the path to our Dockerfile. An alternative approach is to build the image locally and push it to the Docker Hub Registry, then you can use the name of the image instead of a file.

    We set a working directory, it will be used by the scripts from the “scripts” section.

    Also, we can change a Docker data root directory to a directory on the attached volume, then the downloaded images will be saved with a snapshot of the volume. Next time it will take less time to restore the image.
  5. Ports: Ports to expose. In this example, we open 2 ports: 6006 for TensorBoard and 8888 for Jupyter Notebook.

Read more about other parameters in the documentation.


Section 3: Scripts

scripts:
preprocess: |
  curl -O http://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
  tar xvjf LJSpeech-1.1.tar.bz2
  rm LJSpeech-1.1.tar.bz2
  python3 preprocess.py
train: |
  python train.py --model='Tacotron-2'
tensorboard: |
  tensorboard --logdir /workspace/project/logs-Tacotron-2
jupyter: |
  /run_jupyter.sh --allow-root

Scripts are optional, but very useful. They can be run on the instance using the following command:

$ spotty run <SCRIPT_NAME>

For this project we’ve created 4 scripts:

  • preprocess: downloads the dataset and prepares it for a training,
  • train: starts training,
  • tensorboard: runs TensorBoard on the port 6006,
  • jupyter: starts Jupyter Notebook server on the port 8888.

That’s it! The model is ready to be trained on AWS!


Spotty Installation

Requirements

Installation

1. Install Spotty using pip:

$ pip install -U spotty

2. Create an AMI with NVIDIA Docker. Run the following command from the root directory of your project (where the spotty.yaml file is located):

$ spotty create-ami

In several minutes you will have an AMI that can be used for all your projects within the AWS region.


Model Training

1. Start a Spot Instance with the Docker container:

$ spotty start

Once the instance is up and running, you will see its IP address. Use it to open TensorBoard and Jupyter Notebook later.

2. Download and preprocess the data for the Tacotron model. We already have a custom script in the configuration file to do that. Just run:

$ spotty run preprocess

3. Once the preprocessing is done, train the model. Run the “train” script:

$ spotty run train

On a “p2.xlarge” instance it will probably take around 8–9 days to reach 120 thousand steps. But you could use instances with more performant GPUs to make the training faster.

You can detach this SSH session using Ctrl + b, then d combination of keys. The training process won’t be interrupted. To reattach that session, just run the spotty run train command again.


TensorBoard

Start the TensorBoard using the “tensorboard” script:

$ spotty run tensorboard

TensorBoard will be running on the port 6006. You can detach the SSH session using Ctrl + b, then d combination of keys, it still will be running.

Jupyter Notebook

You can use Jupyter Notebook to download trained models to your computer. Use the “jupyter” script to start it:

$ spotty run jupyter

Jupyter Notebook will be running on the port 8888. Open it using the IP address of the instance and the URL that you see in the output of the command.


SSH Connection

To connect to the running Docker container via SSH, use the following command:

$ spotty ssh

It uses a tmux session, so you can always detach it using Ctrl + b, then d combination of keys and attach that session later using the spotty ssh command again.


Stop Instance

Don’t forget to stop the instance once you are done! Use the following command:

$ spotty stop

When you’re stopping the instance, Spotty automatically creates snapshots of the volumes. When you start an instance next time, it will restore the snapshots automatically.

Conclusion

Using Spotty is a convenient way to train Deep Learning models on AWS Spot Instances. It will save you not just up to 70% of the cost, but also a lot of time on setting up an environment for your models and notebooks. Once you have a Spotty configuration for your model, everyone can train it with a couple of commands.

If you enjoyed this article, please star the project on GitHub and share this article with your friends.

This article was originally published on Medium.

Similar blog posts