User:Brian Schott/code/nonhier

From J Wiki
Jump to navigation Jump to search

Statistical cluster analysis can be done on data sets using a nonhierarchical approach in which the user designates n data items as cluster seeds, and then the clustering program iteratively revises the cluster seed values by finding the center of each cluster for which the points around each seed are the closest to their seed. The clusters' centers at each iteration becomes their new seed. Typically the number of iterations is very small. The simple method described is often referred to as kmeans and has the advantage over hierarchical cluster methods that very large sample sizes can be accommodated.

To help find starting seeds which likely partition the observations well, another iterative algorithm can be constructed to find starting seeds. The algorithm shown here is similar to the one used in SAS, but is different. The SAS algorithm always assumes the first n observations are the preseeds and considers each next observation in the data set as a candidate to replace an existing seed if that candidate is farther away from all of the existing seeds than the closest two existing seeds are from one another.

The algorithm here iteratively considers all the observations at every iteration, until no viable replacement candidate exists. At each iteration all of the nonseed observations are candidates for replacing one of the two existing closest seeds. For there to exist a viable replacement, at least one of the nonseed observations must be further away from all of the existing seeds than the closest two existing seeds are from one another. If that condition exists, then the replacement seed is the nonseed that is furthest from all of the existing seeds. This algorithm as implemented here has not been found in the literature, but seems to be similar to the tradtional one and in my opinion better suits J, and seems to produce decent results in a small amount of testing. I would be happy to get suggestions for implementing the traditional algorithm.

The methodology employed here is very basic and is inspired by the statistical analysis package developed by SAS in North Carolina. However, all errors are completely mine. An (extremely simple) example data set is included in the listing: sharma is data taken from a multivariate stats text by Sharma. A slightly more substantial, well-behaved and widely studied data set is also provided here: Fisher's iris data. The rightmost column of the iris data is the known cluster ID for the observation, and is not used in the cluster analysis, but is used for validation of results.

I have little experience with nonhierarchical clustering. Please let me know if you have any recommendations for this material.


download a script: File:Nonhiercluster.ijs

First some utility verbs are defined.

mean =: +/ % #
length =: +/&.:*:
dist=:length@:-"1
sut =: </~@i. NB. strictly upper triangular matrix
isut =: I.@,@sut NB. indices of sut matrix
CP=: {@(,&<)
seedpairs =: isut@# {&, CP/~
iomin =: i.<./  NB. index of (IO) minimum
iomax =: i.>./  NB. index of (IO) maximum
newseedQ =: *./@:<  NB. x<y everywhere?

Two data sets are provided.

sharma =: _2]\5 5 6 6 15 14 16 15 25 20 30 19

iris =: 150 5 $; ". each <;._2  noun define
   50 33 14 02 1 64 28 56 22 3 65 28 46 15 2 67 31 56 24 3
   63 28 51 15 3 46 34 14 03 1 69 31 51 23 3 62 22 45 15 2
   59 32 48 18 2 46 36 10 02 1 61 30 46 14 2 60 27 51 16 2
   65 30 52 20 3 56 25 39 11 2 65 30 55 18 3 58 27 51 19 3
   68 32 59 23 3 51 33 17 05 1 57 28 45 13 2 62 34 54 23 3
   77 38 67 22 3 63 33 47 16 2 67 33 57 25 3 76 30 66 21 3
   49 25 45 17 3 55 35 13 02 1 67 30 52 23 3 70 32 47 14 2
   64 32 45 15 2 61 28 40 13 2 48 31 16 02 1 59 30 51 18 3
   55 24 38 11 2 63 25 50 19 3 64 32 53 23 3 52 34 14 02 1
   49 36 14 01 1 54 30 45 15 2 79 38 64 20 3 44 32 13 02 1
   67 33 57 21 3 50 35 16 06 1 58 26 40 12 2 44 30 13 02 1
   77 28 67 20 3 63 27 49 18 3 47 32 16 02 1 55 26 44 12 2
   50 23 33 10 2 72 32 60 18 3 48 30 14 03 1 51 38 16 02 1
   61 30 49 18 3 48 34 19 02 1 50 30 16 02 1 50 32 12 02 1
   61 26 56 14 3 64 28 56 21 3 43 30 11 01 1 58 40 12 02 1
   51 38 19 04 1 67 31 44 14 2 62 28 48 18 3 49 30 14 02 1
   51 35 14 02 1 56 30 45 15 2 58 27 41 10 2 50 34 16 04 1
   46 32 14 02 1 60 29 45 15 2 57 26 35 10 2 57 44 15 04 1
   50 36 14 02 1 77 30 61 23 3 63 34 56 24 3 58 27 51 19 3
   57 29 42 13 2 72 30 58 16 3 54 34 15 04 1 52 41 15 01 1
   71 30 59 21 3 64 31 55 18 3 60 30 48 18 3 63 29 56 18 3
   49 24 33 10 2 56 27 42 13 2 57 30 42 12 2 55 42 14 02 1
   49 31 15 02 1 77 26 69 23 3 60 22 50 15 3 54 39 17 04 1
   66 29 46 13 2 52 27 39 14 2 60 34 45 16 2 50 34 15 02 1
   44 29 14 02 1 50 20 35 10 2 55 24 37 10 2 58 27 39 12 2
   47 32 13 02 1 46 31 15 02 1 69 32 57 23 3 62 29 43 13 2
   74 28 61 19 3 59 30 42 15 2 51 34 15 02 1 50 35 13 03 1
   56 28 49 20 3 60 22 40 10 2 73 29 63 18 3 67 25 58 18 3
   49 31 15 01 1 67 31 47 15 2 63 23 44 13 2 54 37 15 02 1
   56 30 41 13 2 63 25 49 15 2 61 28 47 12 2 64 29 43 13 2
   51 25 30 11 2 57 28 41 13 2 65 30 58 22 3 69 31 54 21 3
   54 39 13 04 1 51 35 14 03 1 72 36 61 25 3 65 32 51 20 3
   61 29 47 14 2 56 29 36 13 2 69 31 49 15 2 64 27 53 19 3
   68 30 55 21 3 55 25 40 13 2 48 34 16 02 1 48 30 14 01 1
   45 23 13 03 1 57 25 50 20 3 57 38 17 03 1 51 38 15 03 1
   55 23 40 13 2 66 30 44 14 2 68 28 48 14 2 54 34 17 02 1
   51 37 15 04 1 52 35 15 02 1 58 28 51 24 3 67 30 50 17 2
   63 33 60 25 3 53 37 15 02 1
)

The verb reviseseeds is presented first and you can experiment by trying examples like the following for which the y argument is the indices of the planned seeds.

seedrevisiondemo =: noun define
sharma reviseseeds 0 1 2
sharma reviseseeds 5 4 3 2
(}:"1 iris) reviseseeds 35 41 70
35 41 70 {iris
35 89 70 {iris
(}:"1 iris) reviseseeds 0 1 2
 0  1  2 {iris
 0  1 84 {iris
)

reviseseeds =: dyad define
data =. x
seeds =. y
notseeds =. seeds -.~ i.@#data
minseedpair =: isut@# iomin@:{&,dist"1 2 &({&data)/~
closeseeds =. (minseedpair pick seedpairs)seeds

NB. distances between seeds and nonseeds (objects)
s2o =. seeds dist"1 _ &({&data) notseeds

closepairdist =. dist &({&data)/closeseeds  NB. distance between closest seeds

while. +./ closepairdist newseedQ"0 1 |:s2o   NB. if this is true ...
do.
NB. then newseed is the object with the greatest minimum dist to another seed
newseed =. notseeds{~ iomax"0 1 <./s2o
oldseed =. closeseeds {~ iomin newseed dist"1 _&({&data) closeseeds
iooldseed =. seeds i. oldseed
smoutput 'revision: ',": seeds =. newseed iooldseed}seeds
notseeds =. seeds -.~ i.@#data
closeseeds =. (minseedpair pick seedpairs)seeds

NB. distances between seeds and nonseeds (objects)
s2o =. seeds dist"1 _ &({&data) notseeds

closepairdist =. dist &({&data)/closeseeds  NB. distance between closest seeds
end.
seeds
)

Nonhierarchical clustering is very simple in J as seen below.

demo =: noun define
irs =: }:"1 iris
irs&([mean/.~ [:iomin"1 dist"1 _)^:(i. 5)irs{~ irs reviseseeds 35 41 70
irs&([mean/.~ [:iomin"1 dist"1 _)^:(i. 5)irs{~ irs reviseseeds 0 1 2
)

Contributed by Brian Schott <<DateTime>>