Choose Training Configurations for LSTM Using Bayesian Optimization

Open Example

This example shows how to create a deep learning experiment to find optimal network hyperparameters and training options for long short-term memory (LSTM) networks using Bayesian optimization. In this example, you useExperiment Managerto train LSTM networks that predict the remaining useful life (RUL) of engines. The experiment uses the Turbofan Engine Degradation Simulation Data Set described in [1] (seeReferences). For more information on processing this data set for sequence-to-sequence regression, seeSequence-to-Sequence Regression Using Deep Learning.

Bayesian optimization provides an alternative strategy to sweeping hyperparameters in an experiment. You specify a range of values for each hyperparameter and select a metric to optimize, and Experiment Manager searches for a combination of hyperparameters that optimizes your selected metric. Bayesian optimization requires Statistics and Machine Learning Toolbox™. For more information, seeTune Experiment Hyperparameters by Using Bayesian Optimization.

RUL captures how many operational cycles an engine can make before failure. To focus on the sequence data from when the engines are close to failing, preprocess the data by clipping the responses at a specified threshold. This preprocessing operation allows the network to focus on predictor data behaviors close to failing by treating instances with higher RUL values as equal. For example, this figure shows the first response observation and the corresponding clipped response with a threshold of 150.

When you train a deep learning network, how you preprocess data, the number of layers and hidden units, and the initial learning rate in the network can affect the training behavior and performance of the network. Choosing the depth of an LSTM network involves balancing speed and accuracy. For example, deeper networks can be more accurate but take longer to train and converge [2].

By default, when you run a built-in training experiment for regression, Experiment Manager computes the loss and root mean squared error (RMSE) for each trial in your experiment. This example compares the performance of the network in each trial by using a custom metric that is specific to the problem data set. For more information on using custom metric functions, seeEvaluate Deep Learning Experiments by Using Metric Functions.

Open Experiment

First, open the example. Experiment Manager loads a project with a preconfigured experiment. To open the experiment, in theExperiment Browser, double-click the name of the experiment (SequenceRegressionExperiment).

Built-in training experiments consist of a description, a table of hyperparameters, a setup function, and a collection of metric functions to evaluate the results of the experiment. Experiments that use Bayesian optimization include additional options to limit the duration of the experiment. For more information, seeConfigure Built-In Training Experiment.

TheDescriptionfield contains a textual description of the experiment. For this example, the description is:

Sequence-to-sequence regression to predict the remaining useful life (RUL) of engines. This experiment compares network performance using Bayesian optimization when changing data thresholding level, LSTM layer depth, the number of hidden units, and the initial learn rate.

TheHyperparameter Tablespecifies the strategy (Bayesian Optimization) and hyperparameter values to use for the experiment. For each hyperparameter, specify these options:

Range— Enter a two-element vector that gives the lower bound and upper bound of a real- or integer-valued hyperparameter, or a string array or cell array that lists the possible values of a categorical hyperparameter.
Type— Selectreal(real-valued hyperparameter),integer(integer-valued hyperparameter), orcategorical(categorical hyperparameter).
Transform— Selectnone(no transform) orlog(logarithmic transform). Forlog, the hyperparameter must berealorintegerand positive. With this option, the hyperparameter is searched and modeled on a logarithmic scale.

When you run the experiment, Experiment Manager searches for the best combination of hyperparameters. Each trial uses a new combination of the hyperparameter values based on the results of the previous trials. This example uses these hyperparameters:

Thresholdsets all response data above the threshold value to be equal to the threshold value. To prevent uniform response data, use threshold values greater or equal to 150. To limit the set of allowable values to 150, 200 and 250, the experiment modelsThresholdas a categorical hyperparameter.
LSTMDepthindicates the number of LSTM layers used in the network. Specify this hyperparameter as an integer between 1 and 3.
NumHiddenUnitsdetermines the number of hidden units, or the amount of information stored at each time step, used in the network. Increasing the number of hidden units can result in overfitting the data and in a longer training time. Decreasing the number of hidden units can result in underfitting the data. Specify this hyperparameter as an integer between 50 and 300.
InitialLearnRatespecifies the initial learning rate used for training. If the learning rate is too low, then training takes a long time. If the learning rate is too high, then training can reach a suboptimal result or diverge. The best learning rate depends on your data as well as the network you are training. The experiment models this hyperparameter on a logarithmic scale because the range of values (0.001 to 0.1) spans several orders of magnitude.

UnderBayesian Optimization Options, you can specify the duration of the experiment by entering the maximum time (in seconds) and the maximum number of trials to run. To best use the power of Bayesian optimization, perform at least 30 objective function evaluations.

TheSetup Functionconfigures the training data, network architecture, and training options for the experiment. The input to the setup function is a structure with fields from the hyperparameter table. The setup function returns four outputs that you use to train a network for image regression problems. In this example, the setup function has three sections.

Load and Preprocess Datadownloads and extracts the Turbofan Engine Degradation Simulation Data Set fromhttps://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/[3]. This section of the setup function also filters out constant valued features, normalizes the predictor data to have zero mean and unit variance, clips the response data by using the numerical value of the hyperparameterThreshold, and randomly selects training examples to use for validation.

dataFolder = fullfile(tempdir,"turbofan");

if~exist(dataFolder,"dir") mkdir(dataFolder); oldDir = cd(dataFolder); filename ="CMAPSSData.zip"; websave(filename,"https://ti.arc.nasa.gov/c/6/",...weboptions("Timeout"正);解压缩(文件名,dataFolder);cd (oldDir);end

filenameTrainPredictors = fullfile(dataFolder,"train_FD001.txt"); [XTrain,YTrain] = processTurboFanDataTrain(filenameTrainPredictors);

XTrain = helperFilter(XTrain); XTrain = helperNormalize(XTrain);

thr = str2double(params.Threshold);为i = 1:numel(YTrain) YTrain{i}(YTrain{i} > thr) = thr;end

为i=1:numel(XTrain) sequence = XTrain{i}; sequenceLengths(i) = size(sequence,2);end

[~,idx] = sort(sequenceLengths,"descend"); XTrain = XTrain(idx); YTrain = YTrain(idx);

idx = randperm(numel(XTrain),10); XValidation = XTrain(idx); XTrain(idx) = []; YValidation = YTrain(idx); YTrain(idx) = [];

Define Network Architecturedefines the architecture for an LSTM network for sequence-to-sequence regression. The network consists of LSTM layers followed by a fully connected layer of size 100 and a dropout layer with a dropout probability of 0.5. The hyperparametersLSTMDepthandNumHiddenUnitsspecify the number of LSTM layers and the number of hidden units for each layer.

numResponses = size(YTrain{1},1); featureDimension = size(XTrain{1},1); LSTMDepth = params.LSTMDepth; numHiddenUnits = params.NumHiddenUnits;

layers = sequenceInputLayer(featureDimension);

为i = 1:LSTMDepth layers = [layers;lstmLayer(numHiddenUnits,OutputMode="sequence")];end

layers = [layers fullyConnectedLayer(100) reluLayer() dropoutLayer(0.5) fullyConnectedLayer(numResponses) regressionLayer];

Specify Training Optionsdefines the training options for the experiment. Because deeper networks take longer to converge, the number of epochs is set to 300 to ensure all network depths converge. This example validates the network every 30 iterations. The initial learning rate equals theInitialLearnRatevalue from the hyperparameter table and drops by a factor of 0.2 every 15 epochs. With the training optionExecutionEnvironmentset to"auto", the experiment runs on a GPU if one is available. Otherwise, Experiment Manager uses the CPU. Because this example compares network depths and trains for many epochs, using a GPU speeds up training time considerably. Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For more information, seeGPU Support by Release(Parallel Computing Toolbox).

maxEpochs = 300; miniBatchSize = 20;

options = trainingOptions("adam",...ExecutionEnvironment="auto",...MaxEpochs=maxEpochs,...MiniBatchSize=miniBatchSize,...ValidationData={XValidation,YValidation},...ValidationFrequency=30,...InitialLearnRate=params.InitialLearnRate,...LearnRateDropFactor=0.2,...LearnRateDropPeriod=15,...GradientThreshold=1,...Shuffle="never",...Verbose=false);

To inspect the setup function, underSetup Function, clickEdit. The setup function opens in MATLAB Editor. In addition, the code for the setup function appears inAppendix 1at the end of this example.

TheMetricssection specifies optional functions that evaluate the results of the experiment. Experiment Manager evaluates these functions each time it finishes training the network. To inspect a metric function, select the name of the metric function and clickEdit. The metric function opens in MATLAB Editor.

The prediction of the RUL of an engine requires careful consideration. If the prediction underestimates the RUL, engine maintenance might be scheduled before it is necessary. If the prediction overestimates the RUL, the engine might fail while in operation, resulting in high costs or safety concerns. To help mitigate these scenarios, this example includes a metric functionMeanMaxAbsoluteErrorthat identifies networks that underpredict or overpredict the RUL.

TheMeanMaxAbsoluteError指标计算的最大绝对误差,断言aged across the entire training set. This metric calls thepredictfunction to make a sequence of RUL predictions from the training set. Then, after calculating the maximum absolute error between each training response and predicted response sequence, the function computes the mean of all maximum absolute errors. This metric identifies the maximum deviations between the actual and predicted responses. The code for the metric function appears inAppendix 3at the end of this example.

Run Experiment

When you run the experiment, Experiment Manager searches for the best combination of hyperparameters with respect to the chosen metric. Each trial in the experiment uses a new combination of hyperparameter values based on the results of the previous trials.

Training can take some time. To limit the duration of the experiment, you can modify theBayesian Optimization Optionsby reducing the maximum running time or the maximum number of trials. However, note that running fewer than 30 trials can prevent the Bayesian optimization algorithm from converging to an optimal set of hyperparameters.

By default, Experiment Manager runs one trial at a time. If you have Parallel Computing Toolbox™, you can run multiple trials at the same time or offload your experiment as a batch job in a cluster.

To run one trial of the experiment at a time, on the Experiment Manager toolstrip, underMode, selectSequentialand clickRun.
To run multiple trials at the same time, underMode, selectSimultaneousand clickRun. If there is no current parallel pool, Experiment Manager starts one using the default cluster profile. Experiment Manager then executes multiple simultaneous trials, depending on the number of parallel workers available. For best results, before you run your experiment, start a parallel pool with as many workers as GPUs. For more information, seeUse Experiment Manager to Train Networks in ParallelandGPU Support by Release(Parallel Computing Toolbox).
To offload the experiment as a batch job, underMode, selectBatch SequentialorBatch Simultaneous, specify yourClusterandPool Size, and clickRun. For more information, seeOffload Experiments as Batch Jobs to Cluster.

A table of results displays the metric function values for each trial. Experiment Manager highlights the trial with the optimal value for the selected metric. For example, in this experiment, the 23rd trial produces the smallest maximum absolute error.

While the experiment is running, clickTraining Plotto display the training plot and track the progress of each trial. The elapsed time for a trial to complete training increases with network depth.

Evaluate Results

In the table of results, theMeanMaxAbsoluteErrorvalue quantifies how much the network underpredicts or overpredicts the RUL. TheValidation RMSEvalue quantifies how well the network generalizes to unseen data. To find the best result for your experiment, sort the table of results and select the trial that has the lowestMeanMaxAbsoluteErrorandValidation RMSEvalues.

Point to theMeanMaxAbsoluteErrorcolumn.
Click the triangle icon.
SelectSort in Ascending Order.

Similarly, find the trial with the smallest validation RMSE by opening the drop-down menu for theValidation RMSEcolumn and selectingSort in Ascending Order.

If no single trial minimizes both values, opt for a trial that ranks well for both metrics. For instance, in these results, trial 23 has the smallest mean maximum absolute error and the seventh smallest validation RMSE. Among the trials with a lower validation RMSE, only trial 29 has a comparable mean maximum absolute error. Which of these trials is preferable depends on whether you favor a lower mean maximum absolute error or a lower validation RMSE.

To record observations about the results of your experiment, add an annotation.

In the results table, right-click theMeanMaxAbsoluteErrorcell of the best trial.
SelectAdd Annotation.
In theAnnotationspane, enter your observations in the text box.
Repeat the previous steps for theValidation RMSEcell.

To test the best trial in your experiment, export the trained networks and display the predicted response sequence for several randomly chosen test sequences.

Select the best trial in your experiment.
On theExperiment Managertoolstrip, clickExport>Trained Network.
In the dialog window, enter the name of a workspace variable for the exported network. The default name istrainedNetwork.
Use the exported network and theThresholdvalue of the network as inputs to the helper functionplotSequences, which is listed inAppendix 4at the end of this example. For instance, in the MATLAB Command Window, enter:

plotSequences(trainedNetwork,200)

The function plots the true and predicted response sequences of unseen test data.

Close Experiment

In theExperiment Browser, right-click the name of the project and selectClose Project. Experiment Manager closes all of the experiments and results contained in the project.

Appendix 1: Setup Function

This function configures the training data, network architecture, and training options for the experiment.

Input

paramsis a structure with fields from the Experiment Manager hyperparameter table.

Output

XTrainis a cell array containing the training data.
YTrainis a cell array containing the regression values for training,
layersis a layer graph that defines the neural network architecture.
optionsis atrainingOptionsobject.

function[XTrain,YTrain,layers,options] = SequenceRegressionExperiment_setup1(params) dataFolder = fullfile(tempdir,"turbofan");if~exist(dataFolder,"dir") mkdir(dataFolder); oldDir = cd(dataFolder); filename ="CMAPSSData.zip"; websave(filename,"https://ti.arc.nasa.gov/c/6/",...weboptions("Timeout"正);解压缩(文件名,dataFolder);cd (oldDir);endfilenameTrainPredictors = fullfile(dataFolder,"train_FD001.txt"); [XTrain,YTrain] = processTurboFanDataTrain(filenameTrainPredictors); XTrain = helperFilter(XTrain); XTrain = helperNormalize(XTrain); thr = str2double(params.Threshold);为i = 1:numel(YTrain) YTrain{i}(YTrain{i} > thr) = thr;end为i=1:numel(XTrain) sequence = XTrain{i}; sequenceLengths(i) = size(sequence,2);end[~,idx] = sort(sequenceLengths,"descend"); XTrain = XTrain(idx); YTrain = YTrain(idx); idx = randperm(numel(XTrain),10); XValidation = XTrain(idx); XTrain(idx) = []; YValidation = YTrain(idx); YTrain(idx) = []; numResponses = size(YTrain{1},1); featureDimension = size(XTrain{1},1); LSTMDepth = params.LSTMDepth; numHiddenUnits = params.NumHiddenUnits; layers = sequenceInputLayer(featureDimension);为i = 1:LSTMDepth layers = [layers;lstmLayer(numHiddenUnits,OutputMode="sequence")];endlayers = [layers fullyConnectedLayer(100) reluLayer() dropoutLayer(0.5) fullyConnectedLayer(numResponses) regressionLayer]; maxEpochs = 300; miniBatchSize = 20; options = trainingOptions("adam",...ExecutionEnvironment="auto",...MaxEpochs=maxEpochs,...MiniBatchSize=miniBatchSize,...ValidationData={XValidation,YValidation},...ValidationFrequency=30,...InitialLearnRate=params.InitialLearnRate,...LearnRateDropFactor=0.2,...LearnRateDropPeriod=15,...GradientThreshold=1,...Shuffle="never",...Verbose=false);end

Appendix 2: Filter and Normalize Predictive Maintenance Data

The helper functionhelperFilterfilters the data by removing features with constant values. Features that remain constant for all time steps can negatively impact the training.

function[XTrain,XTest] = helperFilter(XTrain,XTest) m = min([XTrain{:}],[],2); M = max([XTrain{:}],[],2); idxConstant = M == m;

为i = 1:numel(XTrain) XTrain{i}(idxConstant,:) = [];ifnargin>1 XTest{i}(idxConstant,:) = [];endendend

The helper functionhelperNormalizenormalizes the training and test predictors to have zero mean and unit variance.

function[XTrain,XTest] = helperNormalize(XTrain,XTest) mu = mean([XTrain{:}],2); sig = std([XTrain{:}],0,2);

为i = 1:numel(XTrain) XTrain{i} = (XTrain{i} - mu) ./ sig;ifnargin>1 XTest{i} = (XTest{i} - mu) ./ sig;endendend

Appendix 3: Compute Mean of Maximum Absolute Errors

This metric function calculates the maximum absolute error of the trained network, averaged over the training set.

functionmetricOutput = MeanMaxAbsoluteError(trialInfo) net = trialInfo.trainedNetwork; thr = str2double(trialInfo.parameters.Threshold); filenamePredictors = fullfile(tempdir,"turbofan","train_FD001.txt"); [XTrain,YTrain] = processTurboFanDataTrain(filenamePredictors); XTrain = helperFilter(XTrain); XTrain = helperNormalize(XTrain);为i = 1:numel(YTrain) YTrain{i}(YTrain{i} > thr) = thr;endYPred = predict(net,XTrain,MiniBatchSize=1); maxAbsErrors = zeros(1,numel(YTrain));为i=1:numel(YTrain) absError = abs(YTrain{i}-YPred{i}); maxAbsErrors(i) = max(absError);endmetricOutput = mean(maxAbsErrors);end

Appendix 4: Plot Predictive Maintenance Sequences

This function plots the true and predicted response sequences to allow you to evaluate the performance of your trained network. This function uses the helper functionshelperFilterandhelperNormalize, which are listed inAppendix 2.

functionplotSequences(net,threshold) filenameTrainPredictors = fullfile(tempdir,"turbofan","train_FD001.txt"); filenameTestPredictors = fullfile(tempdir,"turbofan","test_FD001.txt"); filenameTestResponses = fullfile(tempdir,"turbofan","RUL_FD001.txt"); [XTrain,YTrain] = processTurboFanDataTrain(filenameTrainPredictors); [XTest,YTest] = processTurboFanDataTest(filenameTestPredictors,filenameTestResponses); [XTrain,XTest] = helperFilter(XTrain,XTest); [~,XTest] = helperNormalize(XTrain,XTest);为i = 1:元素个数(YTrain) YTrain {} (YTrain{我}> threshold) = threshold; YTest{i}(YTest{i} > threshold) = threshold;endYPred = predict(net,XTest,MiniBatchSize=1); idx = randperm(100,4); figure为i = 1:numel(idx) subplot(2,2,i) plot(YTest{idx(i)},"--") holdonplot(YPred{idx(i)},".-") holdoffylim([0阈值+ 25)标题("Test Observation "+ idx(i)) xlabel("Time Step") ylabel("RUL")endlegend(["Test Data""Predicted"],Location="southwest")end

References

[1] Saxena, Abhinav, Kai Goebel, Don Simon, and Neil Eklund. "Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation."2008 International Conference on Prognostics and Health Management(2008): 1–9.

[2] Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. "An Empirical Exploration of Recurrent Network Architectures."Proceedings of the 32nd International Conference on Machine Learning(2015): 2342–2350.

[3] Saxena, Abhinav, Kai Goebel. "Turbofan Engine Degradation Simulation Data Set."NASA Ames Prognostics Data Repository,https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/, NASA Ames Research Center, Moffett Field, CA.