NYCJUG/2011-10-11/SmoothOperatorsOverview

From J Wiki
Jump to navigation Jump to search

Smooth Operators

A useful technique for dealing with ordered data like a time series is to smooth a jagged curve. One simple way to do this is by repeated averaging: In the example shown below, we first generate a simple sine wave then add random noise to it to give us a jagged curve. We then apply a few different variations of a general “smoother” to recover an approximation of the general structure underlying the jagged curve. We see the result below of plotting the original curve in blue, the jagged version of it in red, and three variations of our smoothing operator in the other three colors.

SmoothedExample.png

We cannot expect to exactly replicate the original curve as we’ve lost information by adding randomness to it. However, we can see that each of the three smoothers provide some reasonable approximation to it. The green curve, which uses a three-item average, perhaps over-fits the noisy curve but the other two variants, using a five and seven item averaging, give smoother results which provide a better approximation to the original. There are any number of variations on this technique but we’ll not consider those in favor of the broader question “What use is this?”

An Example Use

Say we have some timing data on runs of a model on different input files arranged in order of increasing file size. We see from the graph that lines/second increases for larger files though it appears to be asymptotic. If we have repeated timings on the same set of files so we can average out noise, we’d like to identify those that are persistently faster or slower than they ought to be.

The trick here lies in determining what the speed ought to be since the expected value varies according to the number of lines per file as seen in the graph below. It would be useful to have some sort of local average of speeds to provide a basis for comparison, i.e. we need a smoother line than the one based on the actual measurements.

Once we have this smoothed line, we can identify the most significant outliers above or below this line to tell us which instances are anomalously slow or fast. Having the smoothed version of the curve as a basis allows us not only to identify outliers but to rank them by how anomalous each is compared to its peer group of timings. This is useful as it allows us to focus on the ones with the greatest differences where we expect to get the most useful information.

We want to consider low outliers (faster) separately from high (slower) ones. Let’s plot all the results together to ensure it looks like we expect it to. We’re also excluding the first two points since they tend to be outliers only because of the initial steep slope of the curve.

TwoSmoothersAndSomeOutliers-Labeled.png

We see the ten slowest and fastest outliers marked by red and blue dots, respectively. This is potentially useful information for optimizing the performance of this model.

Possible Extension: Linear Interpolation

One interesting variation on this idea of smoothing is to apply a different underlying function than “mean” as we’ve done above. An interesting variant is to use a simple linear interpolation on groups of points rather than taking their average. Below we see one example of this applied to our timing data though, in this case, we’ve applied the linear interpolation to the smoothed curve rather than to the original one.

ComparisonOfSmootherAndLinearInterpolator.png

One reason to do this is that a “curve” made of straight line segments provides a potentially useful simplification of a complex curve. This is more evident if we look at an interpolative smoother applied to some time series of prices, as shown here.

Px1990LinAprox20DayWdw 20 50pct.jpg

A use of this would be to allow us to classify the prices in a certain time period as generally rising or falling and to quantify the steepness of this rise or fall by the slope of the line segment as seen in the chart above.