Hyperparameter Tuning

Examples and Top 5 Techniques

What is Hyperparameter Tuning?

Hyperparameter tuning is the process of selecting the optimal set of hyperparameters for a machine learning model. It is an important step in the model development process, as the choice of hyperparameters can have a significant impact on the model's performance.

There are several approaches to machine learning model optimization, including model-centric approaches and data-centric approaches. Model-centric approaches focus on the characteristics of the model itself, such as the structure of the model or the types of algorithms used. These approaches typically involve searching for the optimal combination of hyperparameters within a predefined set of possible values.

An example of hyperparameter tuning is a grid search. In grid search, the data scientist or machine learning engineer defines a set of hyperparameter values to search over, and the algorithm tries all possible combinations of these values. For example, if the hyperparameters include the learning rate and the number of hidden layers in a neural network, grid search would try all possible combinations of these hyperparameters, such as a learning rate of 0.1 with one hidden layer, a learning rate of 0.1 with two hidden layers, and so on.

This is part of an extensive series of guides about machine learning

Why Is Hyperparameter Tuning Important?

A well-tuned set of hyperparameters can improve model performance by optimizing the training process. Conversely, poorly chosen hyperparameters can result in models that underfit or overfit the data. Underfitting occurs when the model is too simple to capture the underlying patterns in the data, while overfitting is when the model is too complex, capturing noise as if it were signals.

Hyperparameter tuning also enables generalization. Models that are well-tuned on training data are more likely to perform well on unseen test data. This ensures the model's predictions are reliable in real-world applications. It can also reduce the computational cost and training time.

Understanding Hyperparameter Space and Distributions

The hyperparameter space is the set of all possible combinations of hyperparameters that can be used to train a machine learning model. It is a multidimensional space, with each dimension representing a different hyperparameter. For example, if the hyperparameters include the learning rate and the number of hidden layers in a neural network, the hyperparameter space would have two dimensions: one for the learning rate and one for the number of hidden layers.

The hyperparameter distribution is the distribution of hyperparameter values within the hyperparameter space. It defines the range of values that each hyperparameter can take on, as well as the probability of each value occurring.

In order to tune hyperparameters, it is necessary to search the hyperparameter space for the combination of hyperparameters that results in the best model performance.

Learn more in our detailed guide to hyperparameter grid search (coming soon)

Types and Examples of Hyperparameters

Hyperparameters Tuning for Neural Networks

Examples of hyperparameters that need to be tuned in neural networks include:

  • Number of hidden layers: Each hidden layer can capture different levels of abstraction from the input data. More hidden layers allow the model to learn more complex representations, useful for tasks requiring high-level feature extraction. However, too many hidden layers can lead to overfitting and increased computational cost.
  • Number of nodes/neurons per layer: The number of nodes, or neurons, in each hidden layer impacts the capacity of the neural network to learn from data. More neurons provide the network with a greater ability to learn complex functions, but they also increase the risk of overfitting and the computational load.
  • Learning rate: This controls the step size at each iteration while moving towards a minimum of the loss function. A high learning rate can speed up training but may cause the model to converge to a suboptimal solution or diverge entirely. A low learning rate ensures more precise convergence but can slow down the training process.
  • Momentum: This accelerates gradient descent optimization by adding a fraction of the previous update vector to the current update. It helps smooth out the optimization path and can prevent the model from getting stuck in local minima. The momentum hyperparameter typically ranges between 0 and 1. A common starting point is 0.9, but this can be adjusted based on the behavior of the model during training. Proper tuning of momentum can lead to faster convergence and improved model accuracy.

Hyperparameters Tuning for SVMs

Examples of hyperparameters in Support Vector Machines (SVMs) include:

  • C: This controls the trade-off between achieving a low training error and a low testing error, equivalent to regularization. A small C value makes the decision surface smooth, while a large C value aims to classify all training examples correctly, potentially at the cost of overfitting.
  • Gamma: This defines how far the influence of a single training example reaches. Low values of gamma mean that the model is considering points at a larger distance for determining the separation line, resulting in a smoother decision boundary. High values of gamma mean that only points close to the decision boundary are considered, leading to a more complex and localized model.

Hyperparameters Tuning for XGBoost

Examples of hyperparameters that need to be tuned in XGBoost include:

  • max_depth and min_child_weight: The max_depth parameter determines the maximum depth of a tree, impacting the model's complexity and potential to overfit. The min_child_weight parameter controls the minimum sum of instance weight (hessian) needed in a child, acting as a regularization tool.
  • learning_rate: This parameter, also known as eta, scales the contribution of each tree. Lower values make the model more robust but require more trees. Higher values can speed up training but risk overfitting.
  • n_estimators: This defines the number of trees in the model. More trees can improve accuracy but also increase the risk of overfitting and the computational cost.
  • colsample_bytree and subsample: The colsample_bytree parameter specifies the fraction of features to be randomly sampled for each tree, while subsample controls the fraction of samples used for training each tree. Both parameters help prevent overfitting by adding randomness to the model.

5 Hyperparameter Optimization Techniques

Manual Search

Manual search is a method of hyperparameter tuning in which the data scientist or machine learning engineer manually selects and adjusts the hyperparameters of the model. This method is often used when the number of hyperparameters is relatively small and the model is simple, as it allows the data scientist to have fine-grained control over the hyperparameters.

To use the manual search method, the data scientist defines a set of possible values for each hyperparameter, and then manually selects and adjusts the values until the model performance is satisfactory. For example, the data scientist might start with a learning rate of 0.1 and gradually increase or decrease it until the model's accuracy is maximized.

Pros and cons: The manual search method can be time-consuming and may require significant trial and error to find the optimal combination of hyperparameters. It is also prone to human error, as the data scientist may overlook certain combinations of hyperparameters or may not be able to accurately assess the impact of each hyperparameter on the model's performance.

Grid Search

Grid search is a method of hyperparameter tuning that involves training a model for every possible combination of hyperparameters in a predefined set.

To use the grid search method, the data scientist or machine learning engineer defines a set of possible values for each hyperparameter, and then the algorithm generates all possible combinations of these values. For example, if the hyperparameters include the learning rate and the number of hidden layers in a neural network, the grid search algorithm would try all possible combinations of these hyperparameters, such as a learning rate of 0.1 with one hidden layer, a learning rate of 0.1 with two hidden layers, and so on.

For each combination of hyperparameters, the model is trained and evaluated using a specified metric, such as accuracy or F1 score. The combination of hyperparameters that results in the best model performance is then chosen as the optimal set.

Pros and cons: The grid search method is computationally intensive, as it requires training a separate model for each combination of hyperparameters. It is also limited by the predefined set of possible values for each hyperparameter, which may not include the optimal values. Despite these limitations, the grid search method is widely used due to its simplicity and effectiveness, particularly for smaller and less complex models.

Random Search

Random search is a method of hyperparameter tuning that involves randomly selecting a combination of hyperparameters from a predefined set and training a model using those hyperparameters.

To use the random search method, the data scientist or machine learning engineer defines a set of possible values for each hyperparameter, and then the algorithm randomly selects a combination of these values. For example, if the hyperparameters include the learning rate and the number of hidden layers in a neural network, the random search algorithm might randomly select a learning rate of 0.1 and two hidden layers.

The model is then trained and evaluated using a specified metric, such as accuracy or F1 score. The process is repeated a predefined number of times, and the combination of hyperparameters that results in the best model performance is chosen as the optimal set.

Pros and cons: The random search method is less systematic and may not be as effective at finding the optimal set of hyperparameters, particularly for larger and more complex models. Despite these limitations, the random search method is widely used due to its simplicity and ease of implementation.

Bayesian Optimization

Bayesian search is a method of hyperparameter tuning that uses Bayesian optimization to find the optimal combination of hyperparameters for a machine learning model.

Bayesian optimization works by building a probabilistic model of the objective function (in this case, the performance of the machine learning model) based on the hyperparameter values that have been tried so far. This model is used to predict the next set of hyperparameters to try, based on the expected improvement in model performance. The process is repeated iteratively until the optimal set of hyperparameters is found.

One key advantage of Bayesian optimization is that it can make use of any available information about the objective function, including previous evaluations of the model performance and constraints on the hyperparameter values. This allows it to more efficiently explore the hyperparameter space and find the optimal combination of hyperparameters.

Pros and cons: Bayesian optimization is a more complex method of hyperparameter tuning than grid search or random search, and it requires more computational resources. However, it can be more effective at finding the optimal set of hyperparameters, particularly for larger and more complex models. It is also well-suited to situations where the objective function is noisy or expensive to evaluate.

Learn more in our detailed guide to Bayesian Hyperparameter Optimization

Hyperband

Hyperband is a method of hyperparameter tuning that uses a bandit-based approach to efficiently search the hyperparameter space.

Hyperband works by running a series of "bracketed" trials, in which the model is trained using a range of different hyperparameter configurations at each iteration. At each iteration, the model performance is evaluated using a specified metric, such as accuracy or F1 score. The model with the best performance is selected, and the hyperparameter space is narrowed to focus on the most promising configurations. This process is repeated until the optimal set of hyperparameters is found.

Pros and cons: One key advantage of Hyperband is that it can quickly eliminate unpromising configurations and focus on the most promising ones, which can save time and computational resources. It is also well-suited to situations where the objective function is noisy or expensive to evaluate.

Learn more in our detailed guides to hyperparameter optimization (coming soon)

Hyperparameter Tuning Management with Run:ai

The Run:ai platform takes the complexity out of distributed computing and provides unlimited compute power. It achieves this by pooling compute resources and leveraging them flexibly with elastic GPU clusters. Additional features such as a Kubernetes-based scheduler ensure training is never disrupted and that no machines are left idle. Together with HPO tools, these capabilities enable highly efficient tuning.

In addition, using our fractional GPU capabilities, experiments with a smaller hyperparameter space, which require less compute power, can utilize less GPU memory, freeing up additional GPU space and allowing more experiments to run in parallel (as opposed to using an entire GPU for each experiment). Combining Run:ai’s scheduling and fractional capabilities, experimentation can be sped up by 10x or more.

In one customer example, the Run:ai platform was able to spin up 6,000 HPO runs, each using one GPU. This ensured that at any given moment, there were 30 HPO runs executed simultaneously. The tuning was accomplished via Run:ai’s advanced scheduling features, built on top of Kubernetes. This solution also considerably reduced management overhead by eliminating the need for Python scripts, loops to ensure containers were up and running, and code to take care of failures, manage errors, etc.

Get started with Run:ai today!