1 Introduction

The aim of this project is to create a complete method for training artificial neural networks using genetic algorithms (GAs) and evolutionary algorithms (EAs) as alternatives to the classical backpropagation approach. These optimization methods, inspired by natural evolution, involve the iterative improvement of a population of candidate solutions in order to find the best possible solution.

In this project, we will explore the use of GAs and EAs to optimize ANNs in various ways, including the evolution of connection weights, architectures, hyperparameters, activation functions, and learning rules. While these techniques have several advantages over traditional derivative-based optimization methods, they can be computationally expensive and the error landscape for ANNs can be complex and noisy.

To evaluate the performance of each method, we will work with synthetic data, which allows us to control the sample size, the amount of noise, and the difficulty of the problem. Specifically, we will create a classification problem and compare the results obtained with each method.

2 Setup

These are the libraries we will use throughout the project:

library(GA)
## Loading required package: foreach
## Loading required package: iterators
## Package 'GA' version 3.2.3
## Type 'citation("GA")' for citing this R package in publications.
## 
## Attaching package: 'GA'
## The following object is masked from 'package:utils':
## 
##     de
library(cmaesr)
## Loading required package: ParamHelpers
## Loading required package: BBmisc
## 
## Attaching package: 'BBmisc'
## The following object is masked from 'package:base':
## 
##     isFALSE
## Loading required package: checkmate
## Loading required package: smoof
library(nnet)
library(ggplot2)
library(mlbench)
library(tictoc)
library(caret)
## Loading required package: lattice

We will also set the seed:

set.seed(1)

3 Synthetic data

In the following code snippet, we show how we can create synthetic data. These are the parameters we use:

Size = n
Hardness = d (dimensionality)
Noise = sd (standard deviation)

data <- mlbench.simplex(n=1000,d=2, sd=0.25)
plot(data)


data <- mlbench.simplex(n=1000,d=2, sd=0.5)
plot(data)


data <- mlbench.simplex(n=1000,d=2, sd=0.1)
plot(data)

4 Split data (train, validation, test)

The following function receives a dataframe and the percentage values we use to split the data into train, validation and test set. We will use 0.6% 0.2% 0.2%.

split_data <- function(df, perc_train, perc_val, perc_test) {

  #compute sample size
  ssTrain <- floor(perc_train*nrow(df))
  ssVal <- floor(perc_val*nrow(df))
  ssTest <- floor(perc_test*nrow(df))
  
  #create random indices separation
  idTrain <- sort(sample(seq_len(nrow(df)), size=ssTrain))
  idnotTrain <- setdiff(seq_len(nrow(df)), idTrain)
  idVal <- sort(sample(idnotTrain, size=ssVal))
  idTest <- setdiff(idnotTrain, idVal)
  
  splittedData <- list("train" = df[idTrain,], "validation" = df[idVal,], "test" = df[idTest,])
  
  return (splittedData)
}

#USAGE
data <- mlbench.simplex(n=1000,d=2, sd=0.25)
data <- as.data.frame(data)
splittedData <- split_data(data, 0.6, 0.2, 0.2)
train_data <- splittedData$train
validation_data <- splittedData$validation
test_data <- splittedData$test

summary(train_data)
##       x.1                 x.2           classes
##  Min.   :-1.190644   Min.   :-1.01688   1:148  
##  1st Qu.:-0.388814   1st Qu.:-0.37066   2:154  
##  Median : 0.021266   Median :-0.12154   3:148  
##  Mean   : 0.006736   Mean   :-0.01402          
##  3rd Qu.: 0.393153   3rd Qu.: 0.39190          
##  Max.   : 1.071177   Max.   : 1.19497
summary(validation_data)
##       x.1                x.2           classes
##  Min.   :-1.11032   Min.   :-0.90915   1:49   
##  1st Qu.:-0.38421   1st Qu.:-0.38130   2:45   
##  Median :-0.04684   Median :-0.02484   3:56   
##  Mean   :-0.02996   Mean   : 0.05155          
##  3rd Qu.: 0.31637   3rd Qu.: 0.52201          
##  Max.   : 0.99123   Max.   : 1.09855
summary(test_data)
##       x.1                  x.2           classes
##  Min.   :-1.0663598   Min.   :-0.93273   1:53   
##  1st Qu.:-0.3300275   1st Qu.:-0.41094   2:51   
##  Median : 0.0131126   Median :-0.20340   3:46   
##  Mean   : 0.0007551   Mean   :-0.04815          
##  3rd Qu.: 0.3745331   3rd Qu.: 0.39461          
##  Max.   : 0.8543368   Max.   : 1.03875

5 Back-propagation approach

This is an implementation of the classical Back-propagation technique to train artificial neural networks. The function returns a list with the trained model, the model’s accuracy on the training set, the model’s accuracy on the validation set, the model’s accuracy on the test set, the confusion matrix and the training time.


#Back propagation

train_BP <- function(dt, dimensions, n_units, max_it) {

  #one hot encoder
  dummy <- dummyVars(" ~ .", data=dt)
  dt <- data.frame(predict(dummy, newdata=dt))
  
  #split data
  splittedData <- split_data(dt, 0.6, 0.2, 0.2)
  train_dt <- splittedData$train
  val_dt <- splittedData$validation
  test_dt <- splittedData$test
  
  target <- dimensions+1
  input <- dimensions

  tic('train') #init timer
  #train nn with backpropagation technique
  nn <- nnet(train_dt[,1:input], train_dt[,target:ncol(train_dt)], size = n_units, softmax=T, trace=F)
  toc(log = T, quiet=T) #stop timer
  log.lst <- tic.log(format = FALSE)
  tic.clearlog()
  
  #model evaluation
  evaluate.CM <- function(true, pred) {
    true <- max.col(true)
    pred <- max.col(pred)
    table(true, pred)
  }

  pred<-predict(nn, test_dt[,1:input])
  confusionMat <- evaluate.CM(test_dt[,target:ncol(test_dt)], pred)
  test_accuracy <- mean(max.col(test_dt[,target:ncol(test_dt)])==max.col(pred))
  pred<-predict(nn, train_dt[,1:input])
  train_accuracy <- mean(max.col(train_dt[,target:ncol(train_dt)])==max.col(pred))
  pred<-predict(nn, val_dt[,1:input])
  val_accuracy <- mean(max.col(val_dt[,target:ncol(val_dt)])==max.col(pred))
  
  results <- list('nn' = nn, 'trainAccuracy' = train_accuracy, 'valAccuracy' = val_accuracy, 'testAccuracy' = test_accuracy, 'cm' = confusionMat, 'time' = unlist(lapply(log.lst, function(x) x$toc - x$tic))[1])
  
} 

#USAGE
dim <- 3
data <- mlbench.simplex(n=1000, d=dim, sd=0.3)

plot(data)

dt <- as.data.frame(data)
#results <- train_BP(dt, dim, 50, 1000)

Accuracy of the algorithm for each data portion:

Confusion matrix:

Time:

6 GA approach

This is an implementation of the GA technique to train artificial neural networks. Here’s an overview of how genetic algorithms work:

Initialize the population: The population of solutions, also known as chromosomes, is typically initialized randomly, with each solution representing a potential solution to the optimization problem.
Evaluate the fitness of each solution: We compute the fitness function of each of the solutions. Our goal is to optimize such fitness function.
Select the fittest solutions: The fittest solutions are selected from the population using a selection function. This function typically selects solutions with higher fitness values with higher probability.
Breed new solutions: New solutions are created by combining the selected solutions using genetic operators such as crossover (combining two solutions to create a new solution) and mutation (randomly changing the values of a solution).
Repeat the process: The process is repeated for a number of iterations until a satisfactory solution is found.

For the implementation of the genetic algorithm, it has been decided to use the ga() function, provided by the “GA” [1] library of R. In particular, we will use this algorithm to maximize a custom fitness function. This function will generate an MLP with the weights and number of hidden units determined by the genetic algorithm and will return the accuracy of this network in the training data. To represent this, we have decided to set a maximum number of hidden units for the neural network architecture. This way, the size of each individual is fixed and the first gene in the chain is used as the number of hidden units. The next N genes are used as weights of the neural network and the remaining ones are refused. Understandably, the number N of parameters will depend on the number of hidden units set in each iteration [3].

As before, the following function returns a list with the trained model, the model’s accuracy on the training set, the model’s accuracy on the validation set, the model’s accuracy on the test set, the confusion matrix and the training time.

train_GA <- function(dt, dimensions, max_it) { 
  #one hot encoder 
  dummy <- dummyVars(" ~ .", data=dt)
  dt <- data.frame(predict(dummy, newdata=dt))
  
  #split data
  splittedData <- split_data(dt, 0.6, 0.2, 0.2)
  train_dt <- splittedData$train
  val_dt <- splittedData$validation
  test_dt <- splittedData$test
  
  target <- dimensions+1
  input <- dimensions
  
  n_features <- ncol(train_dt[,1:input])
  n_classes <- ncol(train_dt[,target:ncol(train_dt)])
  
  #fitness function based on train accuracy
  fitness <- function(x) {
    n_hidden <- floor(x[1]) # Number of neurons
    num_param <- n_features * n_hidden + n_hidden * n_classes + n_hidden + n_classes
    Wts <- x[(2:(num_param+1))] # Weights
    nn <- nnet(train_dt[,1:input], train_dt[,target:ncol(train_dt)], size = n_hidden, Wts = Wts, maxit = 0, trace=F, softmax=T)
    pred<-predict(nn, train_dt[,1:input])
    accuracy <- mean(max.col(train_dt[,target:ncol(train_dt)])==max.col(pred))
    return(accuracy)
  }
  
  # Bounds for number of neurons
  neuron_bounds = c(2,50)
  max_param <- n_features * neuron_bounds[2] + neuron_bounds[2] * n_classes + neuron_bounds[2] + n_classes
  
  # Bounds for weights
  bounds <- matrix(nrow = max_param, ncol=2)
  for(row in 1:max_param){
    bounds[row,] <- c(-100,100)
  }

  tic('train') #start timer
  #train nn with GA technique
  gann <- ga(type = "real-valued", fitness = fitness, maxFitness = 1, lower = c(neuron_bounds[1], bounds[, 1]), upper = c(neuron_bounds[2], bounds[, 2]),  popSize = 50, maxiter = max_it, run = 100, monitor=F)
  toc(log = T, quiet=T) #stop timer
  log.lst <- tic.log(format = FALSE)
  tic.clearlog()
  
  n_units <- floor(head(gann@solution, 1)[1]) # Number of neurons
  num_param <- n_features * n_units + n_units * n_classes + n_units + n_classes # Number of weights
  W <- head(gann@solution, 1)[(2:(num_param+1))] # Weights
  nn <- nnet(train_dt[,1:input], train_dt[,target:ncol(train_dt)], size = n_units, Wts = W, maxit = 0, trace=F, softmax=T) #Result NN

  #Model evaluation
  evaluate.CM <- function(true, pred) {
    true <- max.col(true)
    pred <- max.col(pred)
    table(true, pred)
  }

  pred<-predict(nn, test_dt[,1:input])
  confusionMat <- evaluate.CM(test_dt[,target:ncol(test_dt)], pred)
  test_accuracy <- mean(max.col(test_dt[,target:ncol(test_dt)])==max.col(pred))
  pred<-predict(nn, train_dt[,1:input])
  train_accuracy <- mean(max.col(train_dt[,target:ncol(train_dt)])==max.col(pred))
  pred<-predict(nn, val_dt[,1:input])
  val_accuracy <- mean(max.col(val_dt[,target:ncol(val_dt)])==max.col(pred))
  
  results <- list('ga' = gann, 'nn' = nn, 'trainAccuracy' = train_accuracy, 'valAccuracy' = val_accuracy, 'testAccuracy' = test_accuracy, 'cm' = confusionMat, 'time' = unlist(lapply(log.lst, function(x) x$toc - x$tic))[1])
}


#USAGE
dim <- 2
data <- mlbench.simplex(n=1000, d=dim, sd=0.3)

plot(data)

dt <- as.data.frame(data)
#results <- train_GA(dt, dim, 1000)

#plot(results$ga)

Accuracy of the algorithm for each data portion:

Confusion matrix:

Time:

Summary:

7 EA approach

One key difference between GAs and EAs is the way in which they represent candidate solutions. In GAs, candidate solutions are typically represented as strings of symbols, called chromosomes, which encode the potential solutions to a problem. These chromosomes can be modified through processes such as crossover (exchange of genetic material between two chromosomes) and mutation (random modification of a chromosome). In EAs, candidate solutions can be represented in a variety of ways, including as numerical values, vectors, or even complex objects such as neural networks.

Another difference between GAs and EAs is the way in which they explore the space of possible solutions. GAs typically use a fixed set of operations, such as crossover and mutation, to generate new candidates solutions. EAs, on the other hand, can use a wider range of techniques, such as mutation, crossover, selection and recombination to generate and improve candidate solutions.

This is an implementation of the EA technique to train artificial neural networks. In this case, it has been decided to use the “cmaesr” library [2]. The implementation of the algorithm in this section is very similar to the genetic algorithm described in the previous section. The main difference is that the Covariance Matrix Adaptation Evolution Strategy minimizes the objective function [4]. For this reason, in this case the optimization function returns the error of the neural network instead of the accuracy. As before, the function returns a list with the trained model, the model’s accuracy on the training set, the model’s accuracy on the validation set, the model’s accuracy on the test set, the confusion matrix and the training time.

train_EA <- function(dt, dimensions, max_it) { 
  dummy <- dummyVars(" ~ .", data=dt)
  dt <- data.frame(predict(dummy, newdata=dt))
  
  #split data
  splittedData <- split_data(dt, 0.6, 0.2, 0.2)
  train_dt <- splittedData$train
  val_dt <- splittedData$validation
  test_dt <- splittedData$test
  
  target <- dimensions+1
  input <- dimensions
  
  n_features <- ncol(train_dt[,1:input])
  n_classes <- ncol(train_dt[,target:ncol(train_dt)])
  
  # fitness function based on error
  fitness <- function(x) {
    n_hidden <- floor(x[1]) # Number of neurons
    num_param <- n_features * n_hidden + n_hidden * n_classes + n_hidden + n_classes
    Wts <- x[(2:(num_param+1))] # Weights
    
    nn <- nnet(train_dt[,1:input], train_dt[,target:ncol(train_dt)], size = n_hidden, Wts = Wts, maxit = 0, trace=F, softmax=T)
    pred<-predict(nn, train_dt[,1:input])
    accuracy <- mean(max.col(train_dt[,target:ncol(train_dt)])==max.col(pred))

    return(1-accuracy)
  }
  
  # Bounds for number of neurons
  neuron_bounds = c(2,50)
  max_param <- n_features * neuron_bounds[2] + neuron_bounds[2] * n_classes + neuron_bounds[2] + n_classes
  # Bounds for weights
  bounds <- matrix(nrow = max_param, ncol=2)
  for(row in 1:max_param){
    bounds[row,] <- c(-100,100)
  }
  
  # custom objective function  
  obj.fn = makeSingleObjectiveFunction(
    name = "fitness",
    fn = fitness,
    par.set = makeNumericParamSet("x", len = max_param+1, lower = c(neuron_bounds[1], bounds[, 1]), upper = c(neuron_bounds[2], bounds[, 2]))
  )

  tic('train') #start timer
  #train nn with ea technique
  ea = cmaes(
      obj.fn, 
      monitor = NULL,
      control = list(
          sigma = 1.5, # initial step size
          lambda = 50, # number of offspring
          stop.ons = c(
              list(stopOnMaxIters(max_it),stopOnOptValue(0, tol = 1e-08)), # stop after 100 iterations
              getDefaultStoppingConditions() # or after default stopping conditions
          )
      )
  )
  toc(log = T, quiet=T)# stop timer
  log.lst <- tic.log(format = FALSE)
  tic.clearlog()
  
  n_units <- floor(ea$best.param[1]) # Number of neurons
  num_param <- n_features * n_units + n_units * n_classes + n_units + n_classes
  W <- ea$best.param[(2:(num_param+1))] # Weights
  nn <- nnet(train_dt[,1:input], train_dt[,target:ncol(train_dt)], size = n_units, Wts = W, maxit = 0, trace=F, softmax=T) # result nn

  # Model evaluation
  evaluate.CM <- function(true, pred) {
    true <- max.col(true)
    pred <- max.col(pred)
    table(true, pred)
  }

  pred<-predict(nn, test_dt[,1:input])
  confusionMat <- evaluate.CM(test_dt[,target:ncol(test_dt)], pred)
  test_accuracy <- mean(max.col(test_dt[,target:ncol(test_dt)])==max.col(pred))
  pred<-predict(nn, train_dt[,1:input])
  train_accuracy <- mean(max.col(train_dt[,target:ncol(train_dt)])==max.col(pred))
  pred<-predict(nn, val_dt[,1:input])
  val_accuracy <- mean(max.col(val_dt[,target:ncol(val_dt)])==max.col(pred))
  
  results <- list('ea' = ea, 'nn' = nn, 'trainAccuracy' = train_accuracy, 'valAccuracy' = val_accuracy, 'testAccuracy' = test_accuracy, 'cm' = confusionMat, 'time' = unlist(lapply(log.lst, function(x) x$toc - x$tic))[1])
}

#USAGE
dim <- 2
data <- mlbench.simplex(n=1000, d=dim, sd=0.3)

dt <- as.data.frame(data)
#results <- train_EA(dt, dim, 1000)

Accuracy of the algorithm for each data portion:

Confusion matrix:

Time:

Summary:

8 Analysis of the performance of the methods

In this section, we aim to compare the three techniques we have used to train ANN. We will base our conclusions on three scores: the computational time, the accuracy obtained and the size of the architecture, i.e. the number of hidden units, which will measure somehow the complexity of the model. With these, we will try to see which one is better and how they change their performance in terms of the data they use.

Firstly, though, let’s take a look at the following consideration.

8.1 Number of hidden units in BP

When using the back-propagation technique, it is important to consider that a predetermined number of neurons for the hidden layer is required. In contrast, genetic algorithms (GAs) and evolutionary algorithms (EAs) allow the number of units to be optimized as one of the parameters. This difference can raise the question of whether the chosen number of hidden units for BP is an optimal choice, especially when comparing BP to GAs or EAs, in order to ensure a fair comparison.

The following code snippet is devoted to finding the Train, Validation and Test accuracy to see how they change in terms of the number of hidden units.

As we can see, the number of neurons in the hidden layer can have some impact on its accuracy, but it is rather noisy and relatively small. Regardless, we do observe a tendency to overfit as we increase the number of neurons, which leads to a decrease in the accuracy of the Validation and Test data. This means that the choice of such number is quite arbitrary, but it seems better to take a small number, so that we avoid overfitting. From now on, thus, we will work with 15 hidden units.

In the following sections, as we said, we give an overall analysis of the performance of the three techniques when changing the data they run with. We will measure the computational time, the accuracy and the optimal number of hidden units when changing the amount of data and its dispersion.

8.2 Performance in terms of data size

In the following cell, we create 15 different data sets, each with a different size. We save the time, the three accuracies and the number of hidden neurons (in the case of GA and EA). Afterwards, we will plot these features in terms of the dataset samples.

8.2.1 Time

Surprisingly, while the first plot exhibits a linear trend, the other plots are more noisy, with fluctuating values. However, in both cases, the tendency is for the values to increase as the number of samples increases, which is expected. In the GA plot, the values start at around 20 seconds and, in the worst cases, reach almost ten times that amount. In the EA plot, the algorithm takes even longer, starting at 60 seconds and reaching almost 250 seconds in some cases.

Regardless of this, we can ensure the time needed to train the ANN is considerably lower in the case of BP, where even the biggest data set is processed within less than a second. It is therefore clear that time is a clear drawback for both GAs and EAs.

One of the main reasons that these techniques can take a lot of time is because they involve a large number of iterations, each of which requires computation and evaluation. Depending on the size and complexity of the problem being solved, the algorithm may need to evaluate thousands or even millions of possible solutions in order to find the optimal one. Additionally, these algorithms may need to run for a long time in order to explore the entire solution space and find the global optimal solution, rather than getting stuck in a local minimum.

8.2.2 Accuracy

As we can see in the previous plots, the accuracy we obtain does not seem to be significantly affected by the size of the data, regardless of which split of the data we refer to. The last plot exhibits more variability than the other two, with some values reaching as low as 0.5, but we observe no clear trend and such fluctuations appear to be random.

Another remark we can extract from the plots is the fact that the Train, Validation and Test lines are almost overlapped. This is a good indicator and it means that no overfitting is taking place, that is, the ANNs trained with the three techniques do a great job when trying to generalize the patterns they learn.

8.2.3 Number of hidden units

As we have said, the number of neurons in the hidden layer is one of the parameters that, in our work, GAs an EAs have to optimize. Recall that we had set bound so that this number is between 2 and 100.

The horizontal dotted lines are the average of all the scores. We get the following mean values:

Again, these plots exhibit a relatively random behavior. However, it is worth noting that the algorithms consistently prefer to use no more than 50 neurons, even though they were allowed to choose from a range of 2 to 100 neurons. This suggests that, in some cases, a high complexity in the model may slightly reduce its performance.

8.3 Performance in terms of standard deviation

We have just performed an analysis on how the time, accuracy and optimal number of hidden units are affected by the size of the data set we work with. We can now wonder how the same variables are affected by changing the standard deviation. We expect the accuracy to drop significantly, since as we increase the sd parameter, the data gets mixed and much more difficult to classify.

dim <- 2
sequence <- seq(from = 0.1, to = 1, by = 0.1)

time_BP <- c()
acc_BP <- c()

time_GA <- c()
acc_GA <- c()
n_hidden_GA <- c()

time_EA <- c()
acc_EA <- c()
n_hidden_EA <- c()

for(st_dv in sequence){
  data <- mlbench.simplex(n=1000, d=dim, sd=st_dv)
  dt <- as.data.frame(data)
  
  # BP data
  results_BP <- train_BP(dt, dim, 15, 1000)
  time_BP <- c(time_BP, results_BP$time)
  acc_BP <- rbind(acc_BP, c(results_BP$trainAccuracy, results_BP$valAccuracy, results_BP$testAccuracy))

  # GA data
  results_GA <- train_GA(dt, dim, 1000)
  time_GA <- c(time_GA, results_GA$time)
  acc_GA <- rbind(acc_GA, c(results_GA$trainAccuracy, results_GA$valAccuracy, results_GA$testAccuracy))
  n_hidden_GA <- c(n_hidden_GA, floor(results_GA$ga@solution[1]))

  # EA data
  results_EA <- train_EA(dt, dim, 500)
  time_EA <- c(time_EA, results_EA$time)
  acc_EA <- rbind(acc_EA, c(results_EA$trainAccuracy, results_EA$valAccuracy, results_EA$testAccuracy))
  n_hidden_EA <- c(n_hidden_EA, floor(results_EA$ea$best.param[1]))
}

8.3.1 Time


# BP
plot(sequence, time_BP, type = 'l', ylab = "BP Time", col = 'red', xlab = "Standard deviation", lwd = 2)
title("Computational time in terms of standard deviation - BP")


# GA
plot(sequence, time_GA, type = 'l', ylab = "GA Time", col = 'red', xlab = "Standard deviation", lwd = 2)
title("Computational time in terms of standard deviation - GA")


# EA
plot(sequence, time_EA, type = 'l', xlab = "Standard deviation", col = 'red', lwd = 2, ylab = "EA Time")
title("Computational time in terms of standard deviation - EA")

As we saw in the previous section, the BP technique consistently requires less time to run compared to the GA and EA techniques. However, when the standard deviation of the data is high, we observe a subtile trend of increasing time needed to run the code, especially for GA and EA. This is because the high standard deviation makes it more difficult to accurately label the samples, leading to the need for more hidden units in these two approaches. As a result, we require more time due to the additional weights that need to be calculated.

8.3.2 Accuracy

# BP
plot(sequence, acc_BP[,1], type = 'l', ylab = "BP Accuracy", ylim = c(0.3,1), xlab = "Standard deviation", lwd = 2)
lines(sequence, acc_BP[,2], type = 'l', col = 'green', lwd = 2)
lines(sequence, acc_BP[,3], type = 'l', col = 'red', lwd = 2)
legend("topright", c("Train", "Validation", "Test"),
       lty = 1, col = c("black", "green", "red"), lwd = 2)
title("Accuracy in terms of standard deviation - BP")


# GA
plot(sequence, acc_GA[,1], type = 'l', ylab = "GA Accuracy", ylim = c(0.3,1), xlab = "Standard deviation", lwd = 2)
lines(sequence, acc_GA[,2], type = 'l', col = 'green', lwd = 2)
lines(sequence, acc_GA[,3], type = 'l', col = 'red', lwd = 2)
legend("topright", c("Train", "Validation", "Test"),
       lty = 1, col = c("black", "green", "red"), lwd = 2)
title("Accuracy in terms of standard deviation - GA")


# EA
plot(sequence, acc_EA[,1], type = 'l', ylab = "EA Accuracy", ylim = c(0,1), xlab = "Standard deviation", lwd = 2)
lines(sequence, acc_EA[,2], type = 'l', col = 'green', lwd = 2)
lines(sequence, acc_EA[,3], type = 'l', col = 'red', lwd = 2)
legend("topright", c("Train", "Validation", "Test"),
       lty = 1, col = c("black", "green", "red"), lwd = 2)
title("Accuracy in terms of standard deviation - EA")

As expected, the accuracy significantly deteriorates as the sd parameter increases. This is likely because the data becomes more challenging to classify. In most cases, we see a decrease from a perfect score to around 0.5 as the sd parameter increases. However, the EA case exhibits a lower initial accuracy, with a relatively poor performance compared to the other cases.

8.3.3 Number of hidden units

plot(sequence, n_hidden_GA, col = 'red', type = 'l', ylab = "Number of hidden units", xlab = "Standard deviation", ylim = c(1,60), lwd = 2)
avg = mean(n_hidden_GA)
abline(h = avg, lty = 'dotted', col = 'red')
title("Optimal number of hidden units in terms of standard deviation - GA")


plot(sequence, n_hidden_EA, type = 'l', ylab = "Number of hidden units", xlab = "Standard deviation", ylim = c(1,60), lwd = 2, col = 'red')
avg = mean(n_hidden_EA)
abline(h = avg, lty = 'dotted', col = 'red')
title("Optimal number of hidden units in terms of standard deviation - EA")

As we can see, the optimal number of hidden units is difficult to determine based on the plots, which exhibit a lot of variability. However, in general, the number of hidden units tends to fluctuate between 20 and 30, even though it is possible to go as low as 2 or as high as 100. There is also a slight trend of increasing the number of hidden units as the standard deviation of the data increases, possibly due to the increased difficulty in accurately classifying the samples. In these cases, a more complex model may be necessary to achieve satisfactory results.

mean(n_hidden_GA)
## [1] 31.7

mean(n_hidden_EA)
## [1] 29.1

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   1.0.0      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::accumulate() masks foreach::accumulate()
## ✖ dplyr::coalesce()   masks BBmisc::coalesce()
## ✖ dplyr::collapse()   masks BBmisc::collapse()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ purrr::lift()       masks caret::lift()
## ✖ purrr::when()       masks foreach::when()
library(ggpubr)
library(rstatix)
## 
## Attaching package: 'rstatix'
## 
## The following object is masked from 'package:stats':
## 
##     filter

9 Nemenyi test

The Nemenyi test is a non-parametric statistical procedure used to compare the means of multiple groups and determine whether there are significant differences between them. In a Nemenyi plot, the mean ranks of the groups being compared are plotted on the y-axis, and the groups are represented by points on the x-axis. A horizontal line is drawn at the critical difference (CD) value, which is calculated based on the statistical test used. If the distance between two groups is greater than the CD value, then the means of those groups are significantly different.

9.1 Time and test accuracy plots

In this section, we create a standard data set and compute the three techniques to train our ANN. This is repeated 5 times, in which we save the time and the accuracy of the test split. With these values we will plot two graphs, one for the time and the other for the test accuracy, which will allow us to compare the three methods and see which is better in terms of which variable.

dim <- 2
data <- mlbench.simplex(n=1000, d=dim, sd=0.3)
dt <- as.data.frame(data)

time_BP <- c()
acc_BP <- c()

time_GA <- c()
acc_GA <- c()

time_EA <- c()
acc_EA <- c()

set.seed(NULL)
for(iter in 1:5){

  # BP data
  results_BP <- train_BP(dt, dim, 15, 1000)
  time_BP <- c(time_BP, results_BP$time)
  acc_BP <- c(acc_BP, results_BP$testAccuracy)

  # GA data
  results_GA <- train_GA(dt, dim, 1000)
  time_GA <- c(time_GA, results_GA$time)
  acc_GA <- c(acc_GA, results_GA$testAccuracy)

  # EA data
  results_EA <- train_EA(dt, dim, 500)
  time_EA <- c(time_EA, results_EA$time)
  acc_EA <- c(acc_EA, results_EA$testAccuracy)
}

df_time <- as.data.frame(NULL)
df_time <- rbind(df_time, time_BP)
df_time <- rbind(df_time, time_GA)
df_time <- rbind(df_time, time_EA)
df_time <- cbind(c("bp", "ga", "ea"), df_time)
colnames(df_time) <- c("id", "test1", "test2", "test3", "test4", "test5")
df_time
##   id  test1 test2 test3 test4 test5
## 1 bp   0.09  0.09  0.08  0.08  0.09
## 2 ga  22.61 24.75 21.50 18.51 26.19
## 3 ea 105.39 94.53 79.65 96.35 87.92

df_acc <- as.data.frame(NULL)
df_acc <- rbind(df_acc, acc_BP)
df_acc <- rbind(df_acc, acc_GA)
df_acc <- rbind(df_acc, acc_EA)
df_acc <- cbind(c("bp", "ga", "ea"), df_acc)
colnames(df_acc) <- c("id", "test1", "test2", "test3", "test4", "test5")
df_acc
##   id     test1     test2     test3     test4     test5
## 1 bp 0.9133333 0.9200000 0.9000000 0.9133333 0.9066667
## 2 ga 0.9400000 0.9466667 0.9400000 0.8800000 0.8933333
## 3 ea 0.8933333 0.9066667 0.6666667 0.7600000 0.8866667

In this function, given a dataframe like the ones we have previously computed, we can visualize the Nemenyi [5] plots.

CDdiagram <- function(df, plot_title) {
  id <- c("bp", "ga", "ea")
  #compute ranks
  for(i in 2:ncol(df)) {       # for-loop over columns
    df[ , i] <- rank(df[, i], ties.method = "average")
  }
  df.id <- id
  
  #compute mean
  mean_rank <- rowMeans(df[2:ncol(df)])
  
  #compute critical difference
  k = nrow(df) #k different models to compare
  N = ncol(df)-1 #N results for each model
  qa = 2.343 #q alpha nemenyi value for k=3
  
  CD = qa*sqrt((k*(k+1))/(6*N))
  
  #cd-diagram as a ggplot
  data <- data.frame(
    trainig_method=id,
    rank=mean_rank,
    sd=rep(CD, 3)
  )
  diagram <- ggplot() + 
      geom_errorbar(data=data, aes(x=trainig_method, ymin=rank-(sd/2), ymax=rank+(sd/2)), width=0.2, color='blue', size=0.8) +
      geom_point(data=data, mapping=aes(x=trainig_method, y=rank), size=5, shape=21, fill="white") + ggtitle(plot_title)
  
  return(diagram)
}

CDdiagram(df_time, "Time")

CDdiagram(df_acc, "Test accuracy")

As we can see in the previous plots, the BP method definitely comes in first in terms of time, which is something we already saw in the last sections. The fact that its mean is exactly placed in 1 tells us that this method, in the 5 tests we computed, was always the fastest one. On the other hand, EA was always the most computationally expensive one, with its mean placed on 3.

Since the BP and EA lines do NOT overlap, we conclude the means of these two methods are significantly different in terms of time. However, the GA line does overlap the other two, which tells us it is not significantly different from the other two.

If we have to discuss the accuracy obtained, we see that EA was the one that scored the worst in all cases. Its line overlaps with BP, but not GA, so EA and GA are significantly different accuracy-wise. On the other hand, BP and GA have almost the same performance, GA being a little bit better.

As a conclusion to this test, we see that, at least with the conditions and data we have worked on, EA does not give good results, since it is the slowest one and the one with the lowest test accuracy. GA seems to be remarkably better in terms of accuracy, but, still, regarding the time it needs, we assume that BP would be in most of the cases the preferable option.

10 Conclusions

In conclusion, our study showed that the performance of back propagation, genetic algorithms and evolutionary algorithms can vary significantly in terms of accuracy, computational time, and number of hidden units. While BP was consistently the fastest method, EA was generally the slowest. We also observed that GA and EA tended to require more hidden units as the standard deviation of the data increased, indicating that they may be more suitable for more complex problems. However, it is difficult to determine which algorithm is the best overall, as it ultimately depends on the nature of the problem at hand. In this particular case, GA proved to be a strong performer, especially considering the possibility to optimize hyperparameters, such as the number of hidden units, while BP was the top choice for faster performance.

11 References

[1] Luca Scrucca. “Package ‘GA’ manual”. In: Cran R project (October 19, 2022). https://cran.r-project.org/web/packages/GA/GA.pdf (accessed: 22.12.2022).

[2] Jakob Bossek. “Package ‘cmaesr’ manual”. In: Cran R project (October 12, 2022). https://cran.r-project.org/web/packages/cmaesr/cmaesr.pdf (accessed: 22.12.2022).

[3] David J.Montana and Lawrence Davies. “Training Feedforward Neural Networks Using Genetic Algorithms”. In: BBN Systems and Technologies Corp (1989). https://www.ijcai.org/Proceedings/89-1/Papers/122.pdf (accessed: 26.12.2022).

[4] Alejandro Baldominos, Yago Saez, Pedro Isasi. “Evolutionary convolutional neural networks: An application to handwriting recognition”. In: Neurocomputing, 283, pp. 38-52 (2018). https://www.sciencedirect.com/science/article/abs/pii/S0925231217319112?via%3Dihub (accessed: 27.12.2022).

[5] Janez Demsar. “Statistical Comparisons of Classifiers over Multiple Data Sets”. In: Journal of Machine Learning Research 7,pp. 1-30 (2006). https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf (accessed: 27.12.2022).

Training Neural Networks with Evolutionary Algorithms

Gerard Caravaca Ibáñez & Sixte Oriol Llenas i Segura

29 dic. 2022

1 Introduction

2 Setup

3 Synthetic data

4 Split data (train, validation, test)

5 Back-propagation approach

6 GA approach

7 EA approach

8 Analysis of the performance of the methods

8.1 Number of hidden units in BP

8.2 Performance in terms of data size

8.2.1 Time

8.2.2 Accuracy

8.2.3 Number of hidden units

8.3 Performance in terms of standard deviation

8.3.1 Time

8.3.2 Accuracy

8.3.3 Number of hidden units

9 Nemenyi test

9.1 Time and test accuracy plots

10 Conclusions

11 References