Using the Keras Tuner

Recently I read about the Keras Tuner. A hyperparameter tuner in machine learning is a piece of software that will train random combinations of models in order to find the best architecture.


The idea is simple:

  1. make a function that builds your model based on some parameters, called hyper parameters.
  2. instantiate a Tuner with the optimization algorithm you want. They provide a few, like RandomSearch, Hyperband, or BayesianOptimization. Hyperband seems to be the recommended algorithm, but I haven't read the paper to check.
  3. let it build a bunch of models and evaluate them.
  4. when it's done, get the one that yielded the best results.

Built-in Tunable Models

On, they mention "Built-in Tunable Models":

Keras Tuner provides two built-in tunable models: HyperResnet and HyperXception. These models search over various permutations of the ResNet and Xception architectures, respectively

That sounds promising. So maybe I don't even have to make a model making function and ponder on what kind of architectures to try out!

So I tried that.

import kerastuner as kt
tuner = kt.tuners.BayesianOptimization(
  kt.applications.HyperResNet(input_shape=(target_size, target_size, 3), classes=2),

It showed me this summary:

Search space summary
Default search space size: 6
version (Choice)
{'default': 'v2', 'conditions': [], 'values': ['v1', 'v2', 'next'], 'ordered': False}
conv3_depth (Choice)
{'default': 4, 'conditions': [], 'values': [4, 8], 'ordered': True}
conv4_depth (Choice)
{'default': 6, 'conditions': [], 'values': [6, 23, 36], 'ordered': True}
pooling (Choice)
{'default': 'avg', 'conditions': [], 'values': ['avg', 'max'], 'ordered': False}
optimizer (Choice)
{'default': 'adam', 'conditions': [], 'values': ['adam', 'rmsprop', 'sgd'], 'ordered': False}
learning_rate (Choice)
{'default': 0.01, 'conditions': [], 'values': [0.1, 0.01, 0.001], 'ordered': True}

Finally to start the search I ran:,

And it ran for a while:

Search: Running Trial #1

Hyperparameter    |Value             |Best Value So Far 
version           |v1                |?                 
conv3_depth       |4                 |?                 
conv4_depth       |6                 |?                 
pooling           |avg               |?                 
optimizer         |rmsprop           |?                 
learning_rate     |0.1               |?                 

Epoch 1/20
   1068/Unknown - 1378s 1s/step - loss: 0.0000e+00 - accuracy: 0.5025

Accuracy kept hovering around .5. That's a sign that it used categorical_crossentropy as the loss while my labels are just either 0 or 1, not a one hot encoded vector like [1, 0] or [0, 1]. Gotta fix my labels. If I were using one of the built-in methods, like flow_from_directory, I could simply change the class_mode from "binary" to "categorical":

# Or just omit class_mode since "categorical" is the default.
train_generator = train_datagen.flow_from_directory(
  train_dir, target_size=(target_size,target_size), batch_size=batch_size,

But I had manually built my own dataset, so I fixed my dataset generation function. The second issue was that Epoch 1 ran indefinitely with Unknown time left. That's because I forgot to pass the number of steps per epoch. So this time I specified steps_per_epoch and validation_steps.,

Finally, it worked as expected. However the trials took a really long time! Each epoch would take half an hour instead of 3 minutes with my fine-tuning model.

Trial 8 Complete [03h 51m 39s]
val_accuracy: 0.7648402452468872

Best val_accuracy So Far: 0.8333333730697632
Total elapsed time: 11h 58m 49s

Search: Running Trial #9

Hyperparameter    |Value             |Best Value So Far 
version           |next              |v1                
conv3_depth       |4                 |4                 
conv4_depth       |6                 |6                 
pooling           |max               |avg               
optimizer         |adam              |rmsprop           
learning_rate     |0.1               |0.1               

Epoch 1/5
200/200 [==============================] - 1893s 9s/step - loss: 9.1587 - accuracy: 0.5020 - val_loss: 17108.1816 - val_accuracy: 0.3059
Epoch 2/5
200/200 [==============================] - 1929s 10s/step - loss: 0.6998 - accuracy: 0.4891 - val_loss: 32.5436 - val_accuracy: 0.7922
Epoch 3/5
200/200 [==============================] - 1900s 9s/step - loss: 0.6961 - accuracy: 0.5017 - val_loss: 1.9105 - val_accuracy: 0.1667
Epoch 4/5
200/200 [==============================] - 1895s 9s/step - loss: 0.6978 - accuracy: 0.5078 - val_loss: 1.6828 - val_accuracy: 0.1667
Epoch 5/5
 85/200 [===========>..................] - ETA: 18:11 - loss: 0.6966 - accuracy: 0.5122

Looking at the code (, I found out why. It's because it doesn't reuse the pre-trained ResNet model and fine-tunes it. Instead it uses the same architecture but computes all the weights from scratch.

That won't work for me because the trade off between model performance (how well it predicts) and training time is off. I'd rather get less performance for much faster training.

Getting the best model

At the end of the search, I looked at the summary:


Results summary
Results in ./untitled_project
Showing 10 best trials
Objective(name='val_accuracy', direction='max')

Trial summary
version: v1
conv3_depth: 4
conv4_depth: 6
pooling: avg
optimizer: rmsprop
learning_rate: 0.1
Score: 0.8333333730697632 <--

Trial summary
version: v2
conv3_depth: 4
conv4_depth: 6
pooling: avg
optimizer: rmsprop
learning_rate: 0.1
Score: 0.8333333730697632 <--

Trial summary

It found 10 models that had the exact same accuracy on the validation dataset. And here's a look at the confusion matrix to explain why. It blindly returns the first class for everything:

model =

plot_discrimination_thresholds(model, validation_generator)
<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[365,  73],
       [  0,   0]], dtype=int32)>

I really should have gone for the F1 score as the metric to optimize. But right now there's no easy way to do it. See And even then, I don't have nearly enough data to be able to do get better performance than fine-tuning.

Making my own model builder function

It looks like I really will have to write my own model builder function. The thing is, if I implement that and let it search with some heuristic instead of being exhaustive, I won't be able to develop an intuition for what tends to work and what doesn't. For now, I might just write nested loops to exhaustively try models, then chart their discrimination thresholds graphs or confusion matrices.

I'll revisit this post after I experiment further.