NYCJUG/2011-10-11/SmoothOperators

From J Wiki
Jump to navigation Jump to search

Smooth Operators

Smooth a jagged curve by repeated averaging:

meanSmooth=: 3 : 0
   3 meanSmooth y
:
   assert. 1=2|x
   y=. (({.y)$~-:<:|x),y,({:y)$~-:<:|x  NB. Weight end-points to move less.
   if. 0<x do. x mean\y
   else. (|x)([: mean {. , {:)\y end.   NB. Neg. window averages only endpoints.
)

[Here's a question: looking at the line which weights the end-points, is the following tacit version of that line any better?]

   y=. (-:<:|x)([|.] ,~ [# [: ({:,{.)])y NB. Weight end-points to move less.

Test by noising up a sine wave:

   pp=. (1 o. i:5j99)++/<:+:3 100?@$0
   'title Smoothing Example;pensize 2;keystyle fat;key Original "Noised Original" "3&meanSmooth^:10" "5&meanSmooth^:10" "9&meanSmooth^:10";keypos rti' plot (1 o. i:5j99),pp,(3 meanSmooth^:10] pp),(5 meanSmooth^:10]pp),:9 meanSmooth^:10]pp

width="719",height="496"

An Example Use

Say we have some timing data on runs of a model on different input files arranged in order of increasing file size. We see from the graph that lines/second increases for larger files though it appears to be asymptotic. If we have repeated timings on the same set of files so we can average out noise, we’d like to identify those that are persistently faster or slower than they ought to be.

The trick here lies in determining what the speed ought to be since the expected value varies according to the number of lines per file as seen in the graph below. It would be useful to have some sort of local average of speeds to provide a basis for comparison, i.e. we need a smoother line than the one based on the actual measurements.

Once we have this smoothed line, we can identify the most significant outliers above or below this line to tell us which instances are anomalously slow or fast.

Here’s how we might do it for a vector of averaged lines/second called “base”.

   sm1=. 5&meanSmooth"1^:20 base    NB. Smoothed basis to identify outliers.
   ord=. (0 1)-.~\:|difs=. base-sm1 NB. Don’t consider first two points.
   'loo hio'=. (10{.ord#~_1=*ord{difs);10{.ord#~1=*ord{difs

We want to consider low outliers (faster) separately from high (slower) ones. Let’s plot all the results together to ensure it looks like we expect it to. We’re also excluding the first two points since they tend to be outliers only because of the initial steep slope of the curve.

showOutliers=: 3 : 0
   'base sm0 sm1 loo hio'=. y
   pd 'title Two Smoothers and Some Outliers;pensize 2;keystyle fat'
   pd ';key Base "3 Smoother^:10" "5 Smoother^:20"'
   pd base,sm0,:sm1 [ pd 'type line'
   pd 'type point;pensize 4'
   pd loo j. loo{base [ pd hio j. hio{base
   pd 'show'
)
   sm0=. 3&meanSmooth"1^:10 base
   showOutliers base;sm0;sm1;hio;loo

TwoSmoothersAndSomeOutliers-Labeled.png

Note that this graph has been post-processed to add axis labels, include the legend entries for the outliers, and to label the X-axis with the actual number of lines per file.

Possible Extension: Linear Interpolation

One interesting variation on this idea of smoothing is to apply a different underlying function than “mean” as we’ve done above. An interesting variant is to use a simple linear interpolation on groups of points rather than taking their average. Here we see one example of this applied to our timing data though, in this case, we’ve applied the linear interpolation to the smoothed curve rather than to the original one.

EndInterpolationEG.png

To do this, we need to generalize our “smoother” to be an adverb:

generalSmooth=: 1 : 0
   3 mean generalSmooth y
:
   assert. 1=2|x
   y=. (({.y)$~-:<:|x),y,({:y)$~-:<:|x       NB. Weight end-points to move less.
   ;,x u\y
)

The illustration here shows in detail how the linear interpolator works in practice.

Some of the verbs and adverbs we used in the preceding:

linterpEnds=: 3 : '(0{y)+((<:#y)%~-/_1 0{y)*i.#y'
gmSmooth=: 4 : '(#y){.x linterpEnds generalSmooth y'
lintSmooth=: 4 : '(0{x)gmSmooth (1{x)meanSmooth^:(2{x)"1 y'

The usefulness of this verb may be more apparent with a different kind of series as shown here.

Px1990LinAprox20DayWdw 20.png

Here we see how a piecewise linear overlay to the curves gives us a way to quantify and compare one aspect of them – whether they are rising are falling and the steepness of the change in a given period.