beavers Who are Busy

Some Interesting Topics in Data Science and Daily Puzzling


Numerical Stability; when is a statistical algorithm safe?

When building statistical algorithms, there are many factors which can make something rather unstable in such a way that, if not computed exactly (and we can almost never compute exactly with digitized computataion) it might diverge in a very strange way.

When considering what may be considered stable and what may not be, it is important to learn the idea of numerical significance in measurement.

The old analogy goes like this; If you measure a 6.457 inch long pencil using a ruler that only has marks on the inches (no marks on the halves, or the quarters) you’ll be able to tell that it’s between 6 and 7 inches long, and you might be able to tell that it is close to 6.5 inches, but any guess on higher amounts of specificity would be a guess on top of a guess, so we don’t see them as being very valid. This is because, using reasonably good guesses (or even, technically, completely random behavior) the error should always be on the order of 1/10th of an inch.

So in the same way that we hold onto our disbelief a bit when making measurements, we also have to be careful when making conversions between media and storage conventions.

Due to the way that numbers are stored in computers (in binary scientific notation), with large stored values, the absolute error can be very large.

This obviously means that we lose lots of specificity when simply storing big numbers, but it doesn’t mean too much for the loss of relative size, since the amount lost by a stored value will be approximately proportional to its size.

One place where it might actually become a problem, though, is when we start to try to do operations on said large values.

Say, for example, that we have a=12.51 and b=12.49 as our real values, which get rounded to 13 and 12 when stored. If we want to know the difference, a-b, using our stored values will give us 1, even though the real value is 0.02, a 5000% difference.

This is what is known as catastrophic cancellation, and it is a huge player in the production of numerically unstable algorithms.

One example that we can work with in our hands is that of the computation of x^2-y^2

If we go naiively, our computation of x^2 and our computation of y^2 would definitely introduce some error due to roundoff. When x and y are close, this could cause catastrophic cancellation, leading to huge error.

However, if we factor first, then multiply, we will be saving the error introducing multiplication for last, and since the error produced by multiplication will always be much smaller than the absolute value of the result, the algorithm will be numerically stable.

One reason why this idea of numerical stability is historically important to the feilds of statistics and data science is that it was rather hard to come up with a stable algorithm for the computation of the linear correlation of two variables. It is important, though, to remember to try and understand through illustrative examples; one of the reasons why the problem was difficult to solve was catastrophic cancellation, and I hope that, because of that, this post brought you one step closer to investigating and understanding correlation.



Leave a comment

About Me

I’m a person passionate about science, math, and their applications. I thinks it’s important to try to write about what you learn, in some form ot another, so I’ll be doing that here. Hopefully, some of you reading my posts will find them to be just as helpful and interesting to you as they were for me!

Design a site like this with WordPress.com
Get started