Avoiding numerical overflow when calculating the value AND gradient of the Logistic loss function

Question

I am currently trying to implement a machine learning algorithm that involves the logistic loss function in MATLAB. Unfortunately, I am having some trouble due to numerical overflow.

In general, for a given an input s, the value of the logistic function is:

 log(1 + exp(s))

and the slope of the logistic loss function is:

 exp(s)./(1 + exp(s)) = 1./(1 + exp(-s))

In my algorithm, the value of s = X*beta. Here X is a matrix with N data points and P features per data point (i.e. size(X)=[N,P]) and beta is a vector of P coefficients for each feature such that size(beta)=[P 1].

I am specifically interested in calculating the average value and gradient of the Logistic function for given value of beta.

The average value of the Logistic function w.r.t to a value of beta is:

 L = 1/N * sum(log(1+exp(X*beta)),1)

The average value of the slope of the Logistic function w.r.t. to a value of b is:

 dL = 1/N * sum((exp(X*beta)./(1+exp(X*beta))' X, 1)'

Note that size(dL) = [P 1].

My issue is that these expressions keep producing numerical overflows. The problem effectively comes from the fact that exp(s)=Inf when s>1000 and exp(s)=0 when s<-1000.

I am looking for a solution such that s can take on any value in floating point arithmetic. Ideally, I would also really appreciate a solution that allows me to evaluate the value and gradient in a vectorized / efficient way.

There are two files I know which may solve both of your problems. See [this](http://www.mathworks.com/matlabcentral/fileexchange/36534-hpf-a-big-decimal-class) for high precision floating point. See [this](http://www.mathworks.com/matlabcentral/fileexchange/22725-variable-precision-integer-arithmetic) for handling arbitrarily large numbers — Autonomous, Nov 20 '13 at 02:32
@A.Donda They were insightful but did not handle the underflow issue. I found another answer which I will post soon! — Berk U., Nov 27 '13 at 19:04
Hm, but I gave approximations for both large and small s, so that underflow could be avoided. If not, please comment in detail. — A. Donda, Nov 27 '13 at 19:05

A. Donda · Accepted Answer · 2013-11-20T17:00:57.430

9

How about the following approximations:

– For computing L, if s is large, then exp(s) will be much larger than 1:

1 + exp(s) ≅ exp(s)

and consequently

log(1 + exp(s)) ≅ log(exp(s)) = s.

If s is small, then using the Taylor series of exp()

exp(s) ≅ 1 + s

and using the Taylor series of log()

log(1 + exp(s)) ≅ log(2 + s) ≅ log(2) + s / 2.

– For computing dL, for large s

exp(s) ./ (1 + exp(s)) ≅ 1

and for small s

exp(s) ./ (1 + exp(s)) ≅ 1/2 + s / 4.

– The code to compute L could look for example like this:

s = X*beta;
l = log(1+exp(s));
ind = isinf(l);
l(ind) = s(ind);
ind = (l == 0);
l(ind) = log(2) + s(ind) / 2;
L = 1/N * sum(l,1)

edited Nov 20 '13 at 17:00

answered Nov 20 '13 at 02:04

A. Donda

8,381
2
20
49

4

Bingo. You can actually get rigorous bounds on how big s needs to be for your second equality to hold in floating-point arithmetic. (Which is, of course, totally overkill, but it provides a very solid justification for why it's legitimate to do so.) – tmyklebu Nov 20 '13 at 03:53
@tmyklebu, interesting, didn't think of that. – A. Donda Nov 20 '13 at 12:11

score 5 · Answer 2 · edited Aug 16 '23 at 22:00

I found a good article about this problem.

Cutting through a lot of words, we can simplify the argument to stating that the original expression

log(1 + exp(s))

can be rewritten as

log(exp(s)*(exp(-s) + 1))
= log(exp(s)) + log(exp(-s) + 1)
= s + log(exp(-s) + 1)

This stops overflow from occurring - it doesn't prevent underflow, but by the time that occurs, you have your answer (namely, s). You can't just use this instead of the original, since it will still give you problems. However, we now have the basis for a function that can be written that will be accurate and won't produce over/underflow:

function LL = logistic(s)
if s<0
  LL = log(1 + exp(s));
else
  LL = s + logistic(-s);

I think this maintains reasonably good accuracy.

EDIT now to the meat of your question - making this vectorized, and allowing the calculation of the slope as well. Let's take these one at a time:

function LL = logisticVec(s)
  LL = zeros(size(s));
  LL(s<0) = log(1 + exp(s(s<0)));
  LL(s>=0) = s(s>=0) + log(1 + exp(-s(s>=0)));

To obtain the average you wanted:

L = logisticVec(X*beta) / N;

The slope is a little bit trickier; note I believe you may have a typo in your expression (missing a multiplication sign).

dL/dbeta = sum(X * exp(X*beta) ./ (1 + exp(X*beta))) / N;

If we divide top and bottom by exp(X*beta) we get

dL = sum(X ./ (exp(-X*beta) + 1)) / N;

Once again, the overflow has gone away and we are left with underflow - but since the underflowed value has 1 added to it, the error this creates is insignificant.

You need to handle the case where s is very negative, too, and you run into exactly the same issue there. — tmyklebu, Nov 20 '13 at 03:54
Yep. This is basically [log sum of exponentials](http://www.hongliangjie.com/2011/01/07/logsum/). — horchler, Nov 20 '13 at 14:58

Avoiding numerical overflow when calculating the value AND gradient of the Logistic loss function

2 Answers2

Linked