Fine-tuning Google Colab

Sat, Nov 21, 2020

Don’t we all love it when things “just work”? Colab is exactly that. Whether you want to play around with Python in your browser or need to test your machine learning pipeline, it makes your life easier.

While using Colab for the last couple of years, I was looking for ways to optimize its workflow. In this article, I will be sharing tweaks, solutions to the inconveniences I faced, and new ways of utilizing Colab’s power.

Mounting Google Drive

I hated having to mount my Google Drive each time I opened my notebook or my virtual machine (VM) crashed. I had to go in the link, copy the code, and paste it back into the notebook. That’s no way of living. Fortunately, Google added a one-time mounting mechanism to allow you to link your notebook with your Google Drive. You can find it on the left menu you Files > Mount Drive. Now each time you allocate or reconnect to a VM through that notebook, Google Drive is mounted automatically.

Once you mount your Drive, you should not be fetching individual files across the network. The best solution is zipping your files and copying them to your VM on Colab. A single request to fetch all your files and then work on your local storage. I used this code to copy an audio dataset from my Drive to my local storage:

def copy_clips_dataset(name): 
  """
  Copies files from Google Drive to local storage

  Keyword arguments:
    name: of the tar file
  """
  if not os.path.exists(f'/content/{name}/'):
    !cp "/content/drive/My Drive/audio-clips/{name}.tar.gz" /content/
    !tar -xzf /content/{name}.tar.gz -C /content/
  !ls -l /content/{name}/clips | egrep -c '^-'

This function checks if the path exists; in which case you might be running the cell a second time. If the path doesn’t exist, it will copy the .tar.gz file to the storage on the current VM, decompress it, then print the number of files inside it.

Timing your code

I needed more control over time than %timeit and %time (%lsmagic for the full list of magic commands). So instead of injecting time logic throughout my code, I am using decorators. By doing so, I avoid boilerplate and get the flexibility I want.

# Source: https://stackoverflow.com/a/27737385

from functools import wraps
from time import time

def timing(f):
    @wraps(f)
    def wrap(*args, **kw):
        ts = time()
        result = f(*args, **kw)
        te = time()
        print(f'{f.__name__}{args}, {kw} took: {te-ts:2.6f} sec')
        return result
    return wrap

and to use it, all you have to do is:

@timing
def calculate(num):
  return sum([i for i in range(num)])

Each time you execute this function, the output would be something like this:

calculate(10000000,), {} took: 0.979142 sec

Reusing Code

Often I find myself reusing portions of code across multiple notebooks. That’s a sin. Thus, I created a repository to aggregate my utility functions. All I have to do now, to set up my notebook, is clone that repository. Keep in mind that your default !pwd in Colab is /content. When you clone your repository using:

!git clone "https://github.com/AYBLBD/utils.git"

You will be creating a folder inside your present working directory, matching /content/utils. For you to use your module like this:

from utils import utils_colab as uc
uc.save_model(model)

Your folder structure should resemble something like this:

+-- __init__.py
+-- utils
|   +-- __init__.py
|   +-- utils_colab
|   +-- ...

Using MLflow

Have you ever tried to manage your hyperparameters using an Excel file? or worse a text file? I know the feeling. Luckily, there are tools to do that tedious work. The industry standard is MLflow. However, new tools are arising, such as Neptune and Comet that provide storage, data exploration, and teamwork management. I wanted to experiment with these tools without adding another dependency to my projects. And luckily again, these tools implement interfaces to integrate with MLflow. So by using MLflow, I was able to switch between tools as they emerged, experimenting with little to no changes to my code, and all while staying grounded to industry standards.

In this example, I will be using MLflow linked to Neptune. We can start by installing dependencies:

!pip install mlflow==1.12.1
!pip install neptune-mlflow==0.2.5 future==0.18.2

After creating a Neptune account, you will receive a NEPTUNE_API_TOKEN to use. It’s up to you to create and pick a project name. Once you have those two variables, you can set them up in your environment:

os.environ['NEPTUNE_API_TOKEN'] = '''[Your key here]'''
os.environ['NEPTUNE_PROJECT'] = '[User name]/[Project name]'

Then in your code, do your thing with MLflow:

# witchery stuff
...
mlflow.log_metric('validation_loss', validation_loss, step=epoch)
mlflow.log_metric('validation_cer', avg_cer, step=epoch)
mlflow.log_metric('validation_wer', avg_wer, step=epoch)
...
# more witchery stuff

Once you have those metrics, you can persist them to the storage offered by Neptune by executing:

!neptune mlflow

Et voilà! You can now access and visualize the history of your hyperparameters and models, per experiment, through the Neptune dashboard.

Using Pyspark in Colab

Who doesn’t get the urge to run some Spark code on the fly? Right? The following setup is my go-to for that:

import os
import site

def install_pyspark():
  !pip install pyspark
  PACKAGES_PATH = site.getsitepackages()[0]
  os.environ["SPARK_HOME"] = f"{PACKAGES_PATH}/pyspark"

The function site.getsitepackages() will return a list of paths to the installed packages in the system. We are interested in the first one that should be /usr/local/lib/python3.6/dist-packages. Then os.environ["SPARK_HOME"] = f"{PACKAGES_PATH}/pyspark" will add that path as an environment variable.

After that, you are ready to go! To run a Spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark in Colab").config("spark.ui.port", "4050").getOrCreate()

Then test that it works:

spark.createDataFrame([(1, "Hello"), (2, "World")], ["id", "name"]) \
     .show()

And yes, there is a way to check Spark UI. The following code snippet is proudly stolen from CS246:

import time

def create_spark_ui_link():
  !wget -q https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
  !unzip -qq ngrok-stable-linux-amd64.zip
  get_ipython().system_raw('./ngrok http 4050 &')
  time.sleep(3) # For when the tunnel takes seconds to open
  !curl -s http://localhost:4040/api/tunnels | python -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

The generated link will be your access point to Spark UI.

Conclusion

I guess a good reason for me to start reading PEP8, is that I am yet to find a way to indent cells in Colab. But I think I would rather spend hours trying to figure out a way around that 10min read. This is the way.