Hyperparameters are the parameters in models that determine model architecture, learning speed and scope, and regularization.


The search for optimal hyperparameters requires some expertise and patience, and you’ll often find people using exhausting methods like grid search and random search to find the hyperparameters that work best for their problem.


快速教程 (A quick tutorial)

I’m going to show you how to implement Bayesian optimization to automatically find the optimal hyperparameter set for your neural network in PyTorch using Ax.


We’ll be building a simple CIFAR-10 classifier using transfer learning. Most of this code is from the official PyTorch beginner tutorial for a CIFAR-10 classifier.

I won’t be going into the details of Bayesian optimization, but you can study the algorithm on the Ax website, read the original paper or the 2012 paper on its practical use.


首先,通常 (Firstly, the usual)

Install Ax using:


pip install ax-platform

Import all the necessary libraries:


import matplotlib.pyplot as plt
import numpy as np
import torch
import torchvision
import torchvision.transforms as transforms
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

from ax.plot.contour import plot_contour
from ax.plot.trace import optimization_trace_single_method
from ax.service.managed_loop import optimize
from ax.utils.notebook.plotting import render
from ax.utils.tutorials.cnn_utils import train, evaluate

device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")

Download the datasets and construct the data loaders (I would advise adjusting the training batch size to 32 or 64 later):


# Assuming that we are on a CUDA machine, this should print a CUDA device:


transform = transforms.Compose(
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root=\'./data\', train=True,
                                        download=True, transform=transform)
trainloader =, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root=\'./data\', train=False,
                                       download=True, transform=transform)
testloader =, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = (\'plane\', \'car\', \'bird\', \'cat\',
           \'deer\', \'dog\', \'frog\', \'horse\', \'ship\', \'truck\')

Let’s take a look at the CIFAR-10 dataset by creating some helper functions:


def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))

# get some random training images
dataiter = iter(trainloader)
images, labels =

# show images
# print labels
print(\' \'.join(\'%5s\' % classes[labels[j]] for j in range(4)))

培训和评估职能(Training and evaluation functions)

Ax requires a function that returns a trained model, and another that evaluates a model and returns a performance metric like accuracy or F1 score. We’re only building the training function here and using Ax’s ownevaluate tutorial function to test our model performance, which returns accuracy. You can check out ther API to model your own evaluation function after theirs, if you’d like.

def net_train(net, train_loader, parameters, dtype, device):, device=device)

  # Define loss and optimizer
  criterion = nn.CrossEntropyLoss()
  optimizer = optim.SGD(net.parameters(), # or any optimizer you prefer 
                        lr=parameters.get(\"lr\", 0.001), # 0.001 is used if no lr is specified
                        momentum=parameters.get(\"momentum\", 0.9)

  scheduler = optim.lr_scheduler.StepLR(
      step_size=int(parameters.get(\"step_size\", 30)),
      gamma=parameters.get(\"gamma\", 1.0),  # default is no learning rate decay

  num_epochs = parameters.get(\"num_epochs\", 3) # Play around with epoch number
  # Train Network
  for _ in range(num_epochs):
      for inputs, labels in train_loader:
          # move data to proper dtype and device
          inputs =, device=device)
          labels =

          # zero the parameter gradients

          # forward + backward + optimize
          outputs = net(inputs)
          loss = criterion(outputs, labels)
  return net

Next, we’re writing an init_net() function that initializes the model and returns the network ready-to-train. There are many opportunities for hyperparameter tuning here. You’ll notice the parameterization argument, which is a dictionary containing the hyperparameters.

def init_net(parameterization):

    model = torchvision.models.resnet50(pretrained=True) #pretrained ResNet50

    # The depth of unfreezing is also a hyperparameter
    for param in model.parameters():
        param.requires_grad = False # Freeze feature extractor
    Hs = 512 # Hidden layer size; you can optimize this as well
    model.fc = nn.Sequential(nn.Linear(2048, Hs), # attach trainable classifier
                                 nn.Linear(Hs, 10),
    return model # return untrained model

Lastly, we need a train_evaluate() function that the Bayesian optimizer calls on every run. The optimizer generates a new set of hyperparameters in parameterization, passes it to this function, and then analyzes the returned evaluation results.

def train_evaluate(parameterization):

    # constructing a new training data loader allows us to tune the batch size
    train_loader =,
                                batch_size=parameterization.get(\"batchsize\", 32),
    # Get neural net
    untrained_net = init_net(parameterization) 
    # train
    trained_net = net_train(net=untrained_net, train_loader=train_loader, 
                            parameters=parameterization, dtype=dtype, device=device)
    # return the accuracy of the model as it was trained in this run
    return evaluate(


Now, just specify the hyperparameters you want to sweep across and pass that to Ax’s optimize() function:


#torch.cuda.set_device(0) #this is sometimes necessary for me
dtype = torch.float
device = torch.device(\'cuda\' if torch.cuda.is_available() else \'cpu\')

best_parameters, values, experiment, model = optimize(
        {\"name\": \"lr\", \"type\": \"range\", \"bounds\": [1e-6, 0.4], \"log_scale\": True},
        {\"name\": \"batchsize\", \"type\": \"range\", \"bounds\": [16, 128]},
        {\"name\": \"momentum\", \"type\": \"range\", \"bounds\": [0.0, 1.0]},
        #{\"name\": \"max_epoch\", \"type\": \"range\", \"bounds\": [1, 30]},
        #{\"name\": \"stepsize\", \"type\": \"range\", \"bounds\": [20, 40]},        

means, covariances = values

That sure took a while, but it’s nothing compared to doing a naive grid search for all 3 hyperparameters. Let’s take a look at the results:

results[INFO 09-23 09:30:44] ax.modelbridge.dispatch_utils: Using Bayesian Optimization generation strategy: GenerationStrategy(name=\'Sobol+GPEI\', steps=[Sobol for 5 arms, GPEI for subsequent arms], generated 0 arm(s) so far). Iterations after 5 will take longer to generate due to model-fitting.
[INFO 09-23 09:30:44] ax.service.managed_loop: Started full optimization with 20 steps.
[INFO 09-23 09:30:44] ax.service.managed_loop: Running optimization trial 1...
[INFO 09-23 09:31:55] ax.service.managed_loop: Running optimization trial 2...
[INFO 09-23 09:32:56] ax.service.managed_loop: Running optimization trial 3......[INFO 09-23 09:52:19] ax.service.managed_loop: Running optimization trial 18...
[INFO 09-23 09:53:20] ax.service.managed_loop: Running optimization trial 19...
[INFO 09-23 09:54:23] ax.service.managed_loop: Running optimization trial 20...{\'lr\': 0.000237872310800664, \'batchsize\': 117, \'momentum\':
{\'accuracy\': 0.4912998109307719}
{\'accuracy\': {\'accuracy\': 2.2924975426156455e-09}}

It seems our optimal learning rate is 2.37e-4 when comined with a momentum of 0.99 and a batch size of 117. That’s pretty nice. The 49.1% accuracy you see here is not the final accuracy of the model, so don’t worry!

We can go even further and render some plots that show the accuracy per epoch (which improved as the parameterization improved), and the estimated accuracy by the optimizer as a function of two hyperparameters using a contour plot. The experiment variable is of type Experiment and you should definitely check out the docs to see all the methods it has to offer.

best_objectives = np.array([[trial.objective_mean*100 for trial in experiment.trials.values()]])

best_objective_plot = optimization_trace_single_method(
    y=np.maximum.accumulate(best_objectives, axis=1),
    title=\"Model performance vs. # of iterations\",
    ylabel=\"Classification Accuracy, %\",

render(plot_contour(model=model, param_x=\'batchsize\', param_y=\'lr\', metric_name=\'accuracy\'))

The rendered plots are easy to understand and interactive. The black squares in the contour plots show the coordinates that have actually been sampled.

Lastly, you can fetch the parameter set (something Ax calls an “arm”) that has the best mean accuracy by simply running the script below:


data = experiment.fetch_data()
df = data.df
best_arm_name = df.arm_name[df[\'mean\'] == df[\'mean\'].max()].values[0]
best_arm = experiment.arms_by_name[best_arm_name]
Arm(name=’19_0\', parameters={‘lr’: 0.00023787231080066353, ‘batchsize’: 117, ‘momentum’: 0.9914986635285268})

Don’t be afraid to tune anything you wish, like hidden layer number and size, dropout, activation functions, depth of unfreezing, etc.


Happy optimizing!




