Essays/Dendrite/Applications

From J Wiki
Jump to navigation Jump to search

Excluding Outliers

Excluding Outliers is a means of identifying "error" observations from a list of sample results based on the fact that the majority of "correct" results should be clustered closely together, whereas errors are scattered at far distances.

For details and traditional statistical treatment in regression analysis and quantative methods such as standard deviation circle, relative frequencies and confidence ellipse, see Excluding Outliers at Statsoft.

Using Dendrite

We are going to use cluster analysis dendrite method here. The input is processed to produce a list of pairs forming the tree in distance order. We then designate a cut-off point and obtain the correct points as first in this list before the cut-off. The outlying points will be those after the cut-off.

As an example we use a set of (x,y) coordinates of a sought feature on a 2D image as described in Björn Helgason's Forum message,

#!CSV ,
Point,0,1,2,3,4,5
X,328,127,159,328,327,328
Y,364,478,1,363,365,362

The problem there was:

> Lets say I use different methods to find the point and the correct answer
> should be    328 363
>
> Even if the majority are correct or close to it the mean may not be a good
> answer
>   a=.328 364;127 478;159 1;328 363;327 365;328 362
>   6%~+/>a
>266.167 322.167   NB. correct answer should be 328 363

We run this set through Dendrite jobs:
   load'stats/dendrite/jobs'

Creating a Dendrite Job; entering the input data; Run-ing the job and opening all the Results views, we obtain the following picture, switching to neato Program graph option:

Image-dendrite.png

icecalc produces the summary of dendrite analysis (locale 3 is obtained from costate for object pgraphview).

   coinsert 'pdendrite'

   DATA_3_
328 364
127 478
159   1
328 363
327 365
328 362
   icecalc DATA_3_
+---+-----------+-------+
|0 3|0 3|5|4|1|2|      1|
|3 5|0 3 5|4|1|2|      1|
|0 4|0 3 5 4|1|2|1.41421|
|1 4|0 3 5 4 1|2|229.715|
|2 5|0 3 5 4 1 2|  398.6|
+---+-----------+-------+

So we only have to determine the good points based on the cut-off value 10 and the last column as those in indices from the fist column.

   s=. 10 ((> ,@>@{:) ~.@,@# >@{.@]) icecalc DATA_3_
   (+/ % #) s{DATA_3_
327.75 363.5

Using Traditional Cluster Analysis

Another way to use cluster analysis to find outliers is suggested here. The script clusterbasic.ijs is taken from the jwiki User:Brian Schott/code/cluster. In private conversations with Oleg I have been told that using dendrite analysis is much like using the "nearest neighbor" or "single linkage" clustering criterion. A common feature of most hierarchical clustering methods is that outlier observations often find clusters to combine with very late in the hierarchy. In this case observation 159 1 = 2{a is the most outlying, and observation 127 478 = 1{a is also quite separate as shown by Oleg's 2D scatter plot.

   ]a=.>328 364;127 478;159 1;328 363;327 365;328 362
328 364
127 478
159   1
328 363
327 365
328 362

Each row of the next output represent a step in the cluster hierarchy: the first row step 1 and shows 5 clusters; the second row is step 2 and shows 4 clusters; the bottom row shows only 1 cluster. Notice that the cluster with index 2 resists combining with other clusters until there is only 1 cluster.

   SL cluster a  NB. Single Linkage (Nearest Neighbor)
+-----------+-+-+-+-+
|0 3        |1|2|4|5|
+-----------+-+-+-+-+
|0 3 5      |1|2|4| |
+-----------+-+-+-+-+
|0 3 5 4    |1|2| | |
+-----------+-+-+-+-+
|0 3 5 4 1  |2| | | |
+-----------+-+-+-+-+
|0 3 5 4 1 2| | | | |
+-----------+-+-+-+-+
   r     NB. global noun r contains last cluster results
+-----------+-+-+-+-+
|0 3        |1|2|4|5|
+-----------+-+-+-+-+
|0 3 5      |1|2|4| |
+-----------+-+-+-+-+
|0 3 5 4    |1|2| | |
+-----------+-+-+-+-+
|0 3 5 4 1  |2| | | |
+-----------+-+-+-+-+
|0 3 5 4 1 2| | | | |
+-----------+-+-+-+-+

One of the most common methods used for determining the correct number of clusters in hierarchical clustering is to observe the percentage change in the index of determination R^2 which is calculated as R2. In this case the greatest percentage change in R2 is from 3.18924e_5 to 0.237027 as documented below by 743107% . This suggests there should be 3 clusters, where two of the three are the singletons 2{a and 1{a .

   R2    NB. report R^2 values for each step
2.77325e_6 1.1093e_5 3.18924e_5 0.237027 1
   change =: 2-~/\]
   change R2
8.31976e_6 2.07994e_5 0.236995 0.762973
   centchange =: 100*change%}:
   centchange R2
300 187.5 743107 321.894

The averages 327.75 363.5 in (<2 0){mean@({&a) each r below reflect the values reported by Oleg.

   mean =: +/ % #
   mean@({&a) each r NB. cell (<2 0) equals Olegs result
+---------------+-------+-----+-------+-------+
|328 363.5      |127 478|159 1|327 365|328 362|
+---------------+-------+-----+-------+-------+
|328 363        |127 478|159 1|327 365|0 0    |
+---------------+-------+-----+-------+-------+
|327.75 363.5   |127 478|159 1|0 0    |0 0    |
+---------------+-------+-----+-------+-------+
|287.6 386.4    |159 1  |0 0  |0 0    |0 0    |
+---------------+-------+-----+-------+-------+
|266.167 322.167|0 0    |0 0  |0 0    |0 0    |
+---------------+-------+-----+-------+-------+