NYCJUG/2014-08-05

From J Wiki
Jump to navigation Jump to search

Kaggle competition, Higgs boson, J conference 2014, k-NN, k-nearest-neighbors, machine learning


Meeting Agenda for NYCJUG 20140805

1. Beginner's regatta: starting the Kaggle competition for the Higgs Boson
challenge: see "Higgs Kaggle Basics".

2. Show-and-tell: discuss recent J conference.

3. Advanced topics: k-NN: k-Nearest Neighbors method - how to implement
in J?  See "Machine Learning Example - k-NN in F# and OCaml" and
"Comparing k-NN in Rust".

4. Learning, teaching and promoting J, et al.: lack of IDE as a virtue?
See "Getting Back to Coding" and "Just Let Me Code"; also, "Software
Fails the Test of Time".

Proceedings

We discussed a number of approaches to machine-learning in the context of addressing a Kaggle competition to make sense of Higgs boson data.

Beginner's regatta

Higgs Kaggle Basics

The "Higgs Boson Machine Learning Challenge" is a Kaggle competition described here. It states:

The goal of the Higgs Boson Machine Learning Challenge is to explore the potential of advanced machine learning methods to improve the discovery significance of the experiment. No knowledge of particle physics is required. Using simulated data with features characterizing events detected by ATLAS, your task is to classify events into "tau tau decay of a Higgs boson" versus "background."

The competition started at 8:23 pm, Monday 12 May 2014 UTC and ends 11:59 pm, Monday 15 September 2014 UTC.

The problem is outlined in detail in this PDF.

A discussion of this competition on the J-Forum led to this note from Scott Locklin

from:	 Scott Locklin <scott@lugos.name>
to:	 devonmcc@gmail.com
date:	 Fri, Aug 1, 2014 at 4:02 AM
subject: Higgs Kaggle

Hey, have you made any progress on this? I haven't looked at the data at all, but if you think it will submit to KNN, the libFLANN hooks I added with Bill Lam's help should speed things up for you considerably.

I haven't written a nice kernel regression object system for flann yet, but it's in the works (I more or less have to copy some work I did in Lisp) if you need such a thing. I don't have time to make any real contribution…

KNN tricks: local regression on KNN subsets works well sometimes. One thing which really ramps up KNN is to weight the features, either via online pruning (tough to do right) or just mutual information. KNN by itself is cursed with the naive bayes assumption that all the features are equally important; with some sensible weighting, you can get magic.

Another thing I'm fiddling with is CUR decomposition, which would be a more SVD/PCA type approach. It's trivial in J, and I'll be sticking it in my githubs when it's done. It uses lapack SVD for now, so it won't work on really big data or sparse arrays, but it's another TBD for me. Like I said, I haven't even looked at the data, so I have no idea what could be of use, but I know flann is a good NN library.

best regards,

-SL

See the Materials section below for more information on this FLANN library.

Competition Basics

The competition supplies two major sets of data: a training set and a test set. The training set consists of a .CSV file with 250,000 rows of 30 numbers representing measurements taken from the ATLAS experiments. This set has known "answers": a vector of 250,000 labels distinguishing measurements that represent a signal from those representing noise. The other set, the test set, is 550,000 rows of 30 values with unknown answers. The point of the competition is to use the training set to condition a set of algorithms to be able to distinguish signal from noise for the test set.

As of this NYCJUG meeting, I had J code only for the most basic operations of reading in the data and building a proposed set of signal versus noise answers for the test set in a form suitable to submission to the Kaggle site for evaluation. Evaluation is made as a single number, the "AMS statistic", scoring the correct answers versus the false positives and false negatives.

This code is shown below, followed by a sample session in which it is used to build a submission for submission to the contest.

Initial J Code for Kaggle "Higgs Boson" Competition

require 'tables/dsv'

NB.* getTrainingData: break apart .csv file of training data into useful
NB. arrays.
getTrainingData=: 3 : 0
   'trntit trn'=. split (',';'') readdsv y   NB. Split title row from data
   trn=. }:"1 trn [ lbls=. ;_1{"1 trn   NB. Separate character labels from numbers
   trn=. }."1 trn [ evid=. 0{"1 trn     NB. Pull off event IDs as vec
   trn=. ".&>trn                        NB. Data as simple numeric mat
   trn=. }:"1 trn [ wts=. _1{"1 trn     NB. Pull off weights column as vec
   trntit;lbls;wts;evid;<trn
NB.EG 'trntit lbls wts evid trn'=. getTrainingData 'Data/training.csv'
)

NB.* getTestData: read .csv file of test data into useful arrays.
getTestData=: 3 : 0
   'tsttit tst'=. split (',';'') readdsv y   NB. Split title row from data
   tst=. }."1 tst [ evidt=. 0{"1 tst         NB. Pull off event IDs as vec
   tst=. ".&>tst                             NB. Test data as simple numeric mat
   tsttit;evidt;<tst
NB.EG 'tsttit evidt tst'=. getTestData 'Data/test.csv'
)

NB.* calcAMS: calculate AMS metric based on R code in AMSscore.R.
calcAMS=: 3 : 0
   'wts actual guesses'=. y  NB. Weights vector, actual values, predicted values.
NB. Sum signal weights according to predicted label 's' or 'b'.
   's b'=. guesses +//. wts*actual='s'
   's b'=. ('s'~:{.guesses)|.s;b        NB. Correct order from guess order
   br=. 10  NB. Regularization term to reduce variance of measure.
   %:+:s-~(s+b+br)*^.>:s%b+br
)

Example Code Usage

Example J session showing how to use the above code to create a submission file.

NB.   1!:44 pp=. {path to code}
   'trntit lbls wts evid trn'=. getTrainingData 'Data/training.csv'
   'tsttit evidt tst'=. getTestData 'Data/test.csv'

NB. Initial attempt: simple regression.
   coeffs=. (lbls='s')%.trn        NB. Regress s=1 on factors
   ests=. trn+/ . * coeffs         NB. Estimates based on regression

   's'+/ . = lbls                  NB. Number of signals
85667
   lbls #/. lbls                   NB. # 's' vs. 'b'
85667 164333
   ('s'={.lbls)|.lbls #/. lbls     NB. Put 'b' # first
164333 85667

   threshold=. (/:~ests){~-'s'+/ . = lbls  NB. Guess threshold for correct # of each signal
   (ests>:threshold) #/. ests NB. Verify correct number 'b' vs. 's'
164333 85667

   guess1=. 'bs'{~ests>:threshold  NB. Estimates to labels: 's' signal, 'b' background.
   calcAMS wts;lbls;guess1         NB. Measure of goodness is 19.8468 (higher is better)
19.8468

NB. Build submission file.
   submission=. ('EventId';'RankOrder';'Class'),(":&.>350000+i.#ests),.(":&.>>:/:/:ests),.<"0 guess1
   submission writedsv 'Data/NEJedi0.csv';',';''

Show-and-tell

We discussed the J conference this year, including my own contribution here as well as some others hosted at the Journal of J.

Advanced topics

Elaborating on some of the techniques relevant to the Kaggle competition, we looked at some implementations of different nearest neighbor algorithms, specifically k-nearest neighbors. There is a good comparison of implementations in OCaml and F# here as well as one in the Rust language.

Learning and Teaching J

There seems to be increasing discontent with the massiveness of common programming environments, based on the responses to Andrew Binstock's essay in Dr. Dobbs, called Getting Back to Coding. He starts by noting

Last week's editorial (Just Let Me Code!) went viral and received literally hundreds of comments.... ...readers were very familiar with the frustration I expressed of not being able to code as much as I like because I'm spending time taming the tools that support my work: the IDE, the VCS, the defect tracker, test tools, and especially the build system. ... My conclusion was that this tool-level complexity (as opposed to the complexity of the code itself) had turned programming into an operational activity rather than a creative one.

This frustration is familiar to many of us in the J community, oriented as we are to a succinct notation that lets us get quickly to the heart of a problem. This is particularly noticeable in cases where, as in the previous Advanced Topics section, we have supposedly-working implementations of algorithms we would like to render in J but must wade through the unnecessary complexity and overhead imposed by other, less-fortunate languages. This is particularly noticeable if one attempts to validate such code by running it: the time spent simply getting a random algorithm to compile and run is seldom trivial.

This shortcoming is particularly poignant when we consider the points raised in this comment from a SlashDot discussion on age discrimination in the tech industry - Software fails the test of time - in which the author attests to how little progress software as such has made over the past 45 or so years. This rings especially true to those of us from the APL world where we feel like we've been faint voices in the wilderness, our good ideas widely ignored for decades now. It reminds me of the quote from Howard Aiken:

!#wiki gray/dotted
Don't worry about people stealing your ideas.  If your ideas are any good, you'll have to ram them down people's throats.

Materials