Symbolic Regression With Genetic Programming

In Machine Learning, Regression Analysis is a process for modeling the relationship between variables. There are many types of regression such as Linear regression, Logistic regression, Quantile regression, Ridge regression, and more. Each type conforms to a specific model such as linear v.s. logistic. The biggest part of using regression analysis is knowing what model to use. What if our machine learning technique could figure that out for us automatically?

Enter Genetic Programming.

With Genetic Programming, we can use Symbolic regression, which is a regression without a model. It will discover the model from the data during the analysis. Compared to symbolic regression, the other classic regression techniques are more like optimizations since they are just trying to find the best coefficients to the variables that are predetermined.

How Does Symbolic Regression Work?

Using Genetic Programming, we first determine what mathematical functions we want to use, (e.g. +, -, *, /, sin(), cos(), ln(), etc.). Then, we determine the terminals, which are our variables and constants, (e.g. x, e, PI, {1..5}). These serve as our building blocks for constructing the GP parse trees.  The Genetic Programming algorithm will discover the function that best fits the data using an error function such as root mean square as part of the fitness function.

When I was first teaching myself Genetic Programming, the Symbolic regression problem is what really attracted me the most and got me to understand how GP worked. Koza’s book, “Genetic Programming: On The Programming of Computers by Means of Natural Selection” was a great first book and is widely referenced on this subject.

Symbolic Regression in Action

Let’s look at an example of symbolic regression. I’m going to use my Genetic Programming library and I already have a symbolic regression problem app I created on top of it.

The function we’ll try to solve for will be the one from Koza’s first book on GP:

x^4 + x^3 + x^2 + x

For the parameters I’ll use the following:

  • Function Set: +,-,*,/,sin,cos,ln
  • Terminal Set: X
  • Generations: 150
  • Population Size: 500
  • Crossover rate: 90%
  • Mutation rate: 10%
  • Reproduction rate: 10%
  • Fitness cases: 20 data points from the target function, evenly sampled from the interval [-1,1].

Here’s a log output of the first generation:

New Best Individual. Gen (0) :
add (add (mul (X, mul (X, X)), X), div (Sin (sub (X, X)), div (Ln (X), div (X, X)
Fitness:(Hits:Trials) : (4:20), Adjusted Fitness: 0.663302986469959, Standard 
Fitness: 0.50760666

This actually simplifies to: (X^3)+X

At generation 39, we get our best of run individual, hitting all 20 fitness cases with a very small overall error.

Best Individual. Gen (39) :
add (add (add (mul (Sin (Sin (Ln (X))), mul (X, X)), X), div (Sin (sub (X, X)),
div (X, sub (mul (X, mul (Sin (X), mul (X, X))), div (div (add (Sin (X), mul (X,
 Exp (X))), div (Cos (X), Exp (X))), Sin (Cos (Exp (X)))))))), mul (X, mul (add
(mul (Sin (Ln (X)), mul (X, X)), X), X)))
Fitness:(Hits:Trials) : (20:20), Adjusted Fitness: 0.999934421216811, Standard 
Fitness: 6.55830840476038E-05

Is this a perfect solution? No, but it’s pretty close, and we didn’t tell the algorithm what the function looked liked beforehand. We fed it only some sample data of the function output, and it determined on its own a function that matches that data, approximately.

If we ran the algorithm a few times, it is possible for it to find the 100% correct solution, but since the algorithm uses randomness, the ideal solution is never guaranteed (and that’s OK!).

What if we had data where there was no source function? Having an approximate solution at 99%+ correctness sounds great.