User:Brian Schott/code/cluster

From J Wiki
Jump to navigation Jump to search

Statistical cluster analysis can be done on small data sets using a hierarchical approach which starts with all observations as single clusters, and combining two "close" clusters iteratively at each step. The methodology employed here is very basic and is inspired by the statistical analysis package developed by SAS in North Carolina. However, all errors are completely mine. Only one (extremely simple) example data set is included in the listing: sharma is data taken from a multivariate stats text by Sharma. Nonhierarchical cluster analysis can handle large data sets, but I have not much experience with such. Please let me know if you have any recommendations for this material.

NB. clusterbasic.ijs
NB. 12/17/5
NB. revising clusoverall.ijs
NB. revising cluster.ijs
NB. also starting from scratch

NB. usage t =. <method> cluster <data>
NB. eg:   t =.     WM   cluster sharma
NB. then: t              show sharma    NB. shows cluster history
NB.   or:  sharma show~3{."1]_3{t       NB. shows history for 3 clusters
NB.  and:  R2                           NB. show Rsquared history
NB.  and:  nTies                        NB. show ties     history
NB.  and:  r                            NB. show cluster pointer history

NB. methods available:
NB. WM:Ward's
NB. CL:Complete linkage
NB. CM:Centroid
NB. AL:Average linkage
NB. SL:Single linkage (nearest neighbor)

show =: [ {&.> [: < ]    NB. to see cluster results starting at maxclus
maxclus =: 10
NB. sets the final number of clusters for which detailed statistics
NB. can be sent to an output file or printed

cluster =: dyad define
  init y
  method =. x
  dists =: dists__method currclus
  while. <:#currclus
  do.
    nTies =: nTies ,<:#ndx =. (locmins # i.@#) lfm dists
    curr =: >{:tmp =. ,tmp0 =. ndx&{"1 lfm"2 (<"0 ,: cp) dists
    if. 1&<#ndx       NB. optionally deal with ties
    do.
      dists2 =. (>{:tmp0) dist__method currclus
      ndx2 =. tmp0 FAR +/"1 dists2
      curr =: >{:tmp =. ndx2{"1 tmp0
    end.
    cluskey =: ;(>{: tmp) combine <"0 i. #clusptrs
    clusptrs =: cluskey <@;/. pclusptrs =. clusptrs
    currclus =:  cluskey <@;/.  pcurrclus =: currclus
    if. maxclus>:#currclus
    do.
      R =. Rs >each  currclus
      R2 =: R2,-.R
      r =: r, clusptrs
    end.
    dists =: curr distsch curr  dist__method pcurrclus
  end.
  r
)

init =: monad define
  clusptrs =: <"0 i. N =: #obs =. ;/y
  clusno =:  N # N
  pts =:  ({:"1 y) </. obs
  currclus =:  clusptrs </. obs
  nTies =: i. 0
  SS =: +/diag SSCP&:> obs
  R2 =: i. 0
  NB. allr_cluster_ =: i. 10 0
  r =: 0 0$a:
)

distsch =: dyad define
  t =. x
  new =. 0 t} y
  fix =. (_1{ t)} NB. this _1 must coordinate with combine
  dists =. new&fix &.|: new&fix dists
  drop =. (<<<0{ t)&{ NB. this 0 must coordinate with combine
  drop"1 drop dists
)

length =: sum &. (*:"_)
Edstnc =: length@:-
Edstnc =: ss@:- NB. more efficient
nrmlz =: ] % length
cells =: +/~ @:(#&>) NB. table of cluster sizes

NB. routines for strictly lower triangular matrices to lists and back

expand =: #^:_1
cslt =: &.(lfm :. mfl) NB. create slt from full matrix
slt =: >/~@i. NB. strictly lower triangular matrix
lfm =: slt@# #&, ]  NB. list from matrix
fsl =: <:@fms { los@>: NB. find string length needed
fms =: ndxle >:@i. 1: NB. find matrix size
los =: +/\@ i. NB. list of sizes
ndxle =: (<:los)~  NB. index <:  list of sizes
mfl =: (],])@fms@# $,@slt@fms@#expand fsl@#{.]  NB. slt from list

NB. routines for describing cluster homogeneity, etc.

indices =: ]each cslt@(CP&:(i.@#)/~)
spr =: (-`+)/@:(+/@ diag @SSCP&>)each cslt@(ap&:>each/~)
ns =:(,&#)each/~
ap =: <@,,;
diag =: (<0 1)&|:

NB. cluster distances

singlelinkage =: <./@,&>
completelinkage =: >./@,&>
averagelinkage =: avg@,&>
centroid =: ctr each@(> each)

NB. ways to deal with ties
FAR =: locmax@]
FAR =: locclose
FAR =: sas

min =: <.
max =: >.
locmins =: = min/
v_locmins =: ]<: >:@%&100@[ * min/@]
locmin =: i. min/
locmax =: i. max/
min_numb =: ({.,#)@(#~ locmins)
mins =: &v_locmins

sas =: dyad define
  t =. x( >@{:@[{<./&>@])clusptrs NB. min ID for each potential cluster record
  tt =. {.@:/:@:(/:~"1) t NB. index of smallest of sorted of each cluster's 2 IDs
)

locclose =: dyad define
  t =. x
  locmin y
)

incrQ =: 2&(</\)&(,&0)
frst0 =: i.&0

transpose =: |:
sel =:[: ; [ {.&:>each ([: - [) {each ]

NB. measuring changes in clustering coefficients

dif =: 2: -~/\]
pd =: dif % }:

NB. measuring sums of squares

ss =: sum @: *:
sum =: +/
sqrt =: %:
n =: #
avg =: sum%n
ctr  =: avg"2 NB. centroid
mnc =: ] -"1 ctr
mp =: sum . *
sscp =: transpose mp ]
transpose =: |:
SSCP =: sscp@mnc
df =: <: @: n
df =: +/@:(#&>)-1:  NB. added 2/4/2
dfw =: +/@:>@:(#each) - #
dfw =: +/@:(#&>) - #  NB. added 2/4/2
SSCPw =: sum@:>@:(SSCP each)
SSw =: SSCPw%dfw
SSt =: SSCPt%dfw   NB. added 2/4/2
SSb =: SSCPb%dfw   NB. added 2/4/2
SSCPt =: SSCP@;@,.
SSCPb =: SSCPt - SSCPw
Rs =: sum@:((<0 1)&|:@SSCPb)%sum@:((<0 1)&|:@SSCPt)
det =: -/ . *

CP =: {@(,&<)  NB. Catalog of dyadic values
cp =: CP&(i.@#)~  NB. Catalog of a array's indices
freqcount=: (\: {:"1)@(~. ,. #/.~)

combine =: (nit <@:# foc)`({:@[)`]}
foc =: {.@[{.@>@{] NB. first atom of object to be copied
nit =: ({:@[#@>@{]) NB. number of atoms in target object

stddev =: ss@mnc sqrt @ % df
variance =: ss@mnc % df
std =: mnc %"1 stddev
S =: SSCP % df
R =: SSCP@std % df

NB. load <path,'iris.data'
NB. load <path,'wine.data'
NB. load <path,'glass.data'
sharma =: transpose 2 6$5 6 15 16 25 30 5 6 14 15 20 19

NB. **************************************************
NB. **************************************************

coclass 'method'

create =: verb define
  0
)

destroy =: codestroy

cocurrent 'base'
WM =: conew 'method'
CL =: conew 'method'
CM =: conew 'method'
AL =: conew 'method'
SL =: conew 'method'

dists__WM =:  3 :'> -:@:-:@:Edstnc__  each/~ centroid__  y'  NB. **HALF** of distance squared

dist__WM =: dyad define
t=. x
Nj =. n__"2 &>@(> each) y
NkNl =.  t{Nj
Nkl =. +/|: NkNl
NjNkl =. NkNl +/ Nj
NjNm =. Nkl +/ Nj
DjDkl =. t{dists__
Dkl =. (<"1 t){dists__
((+/"2 NjNkl*DjDkl)-Nj*/~Dkl)%NjNm
)

dists__CL =:  3 :'completelinkage__ Edstnc__"1/&:> each/~y'

dist__CL =: dyad define
t=. x
>./"2 t{ dists__
)

dists__SL =:  3 :'singlelinkage__ Edstnc__"1/&:> each/~y'

dist__SL =: dyad define
t=. x
<./"2 t{ dists__
)

dists__CM =:  3 :'> Edstnc__ each/~ centroid__ y'

dist__CM =: dyad define
t=. x
Nj =. n__"2 &>@(> each) y
NkNl =.  t{Nj
Nkl =. +/|: NkNl
DjDkl =. t{dists__
Dkl =. (<"1 t){dists__
DklNkNl =. */Dkl ,|:NkNl
((+/"2 NkNl*DjDkl)% Nkl)- (DklNkNl%*:Nkl)
)

dists__AL =:  3 :'> Edstnc__"1 each/~ centroid__ y'

dist__AL =: dyad define
t=. x
Nj =. n__"2 &>@(> each) y
NkNl =.  t{Nj
Nkl =. +/|: NkNl
DjDkl =. t{dists__
Dkl =. (<"1 t){dists__
(+/"2 NkNl*DjDkl)% Nkl
)

Contributed by Brian Schott