From J Wiki
Jump to navigation Jump to search

weighted median, box cut example, simple shape, canonical projection, a-periodic tiles, Javascript as target language, bilingualism's effect on brain, language affects thought, R programming, examples of introduction to programming, game-based learning

Location:: Heartland


             Meeting Agenda for NYCJUG 20120612
1. Beginner's regatta: comparison and experiment with weighted median - see
"Elegant SQL - weighted median.pdf" & follow-up from last month - see
"Using box cut data.pdf".

Why do we have to continue to put up with new languages that do such a
crappy job of array-handling?  See "A Conversation About Shape.pdf".

2. Show-and-tell: More on JHS: see "Hello World demo for JHS.pdf".

Aperiodic tiles - see excerpt from "JournalOfJ_April2012.pdf" and

3. Advanced topics: See "Teaching code.pdf" and "Javascript Performance

Affect of language on thinking: see "Bilingual brain boost.pdf" and
"who-dunnit-Crosslinguistic Differences in Memory.pdf".

See "Why does one language succeed and another one fail.pdf".

4. Learning, teaching and promoting J, et al.: See "First Thoughts on R.pdf",
"Example Prelim Intros to Programming Languages.pdf", "The State of Games in
the Classroom.pdf", and "Programming language board game.pdf".

Beginner's regatta

Weighted Median - Calculation and Pitfalls

We looked at an example of elegant SQL code for computing a weighted median, then at some J code to compute median. The details of a follow-on to the original (SQL) request for a weighted median calculation closely echoed a request I was looking at for work, so we explored it in a little detail.

The problem was not simply to find the median of a series of values weighted by another series but to use the division created by the median value to partition another related dataset. For example, we might be finding the median of market-cap - which is a company's number of shares outstanding weighted by their market price - then comparing the companies with below-median market-cap to those with above-median market-cap on some other measure like an earnings-to-price ratio.

In the J example of working out this problem, we saw how we might simply modify the median calculation to return the index (or indexes, in the case of an even number of values) of the median item (or items). So, we revamped this definition

   median=: -:@(+/)@((<. , >.)@midpt { /:~)

to this one:

   medianPosn=: [:~.(<. , >.)@midpt { /:

This latter definition does not average the two (possibly distinct) midpoints -  -:@(+/)@ - and works on the grade vector -  /: - rather than the sorted list -  /:~ . However, some testing of this new definition revealed a conceptual shortcoming:

   (wts*vals);(wts=. >:10 ?@$ 9);vals=. >:10 ?@$ 5
|9 15 10 5 20 5 8 6 16 32|3 5 2 5 4 5 8 2 4 8|3 3 5 1 5 1 1 3 4 4|
   median wts * vals
   medianPosn wts * vals
0 2

Note that the pair of median positions are not adjacent. This points up an implicit condition of this notion of applying a median on one set of values to another set of related values: the items are implicitly ordered by the weighted values on which the median is calculated.

So, to use this concept properly, in J we might do something like this:

   gv=. /: wts * vals
   medpt=. -:+/medianPosn gv { wts * vals

   belowMedian=. (/:gv){medpt>i.#gv  NB. Boolean to select items below weighted measure
1 0 0 1 0 1 1 1 0 0
   belowMedian#wts*vals              NB. Verify that we get the right weighted values
9 5 5 8 6
   (-.belowMedian) # wts*vals
15 10 20 16 32

An interesting twist to this way of generating a boolean to select the below-median values is that we "unsort" the simple boolean generated on the sorted set -  medpt>i.#gv - by indexing by the grade of the grade vector -  /: gv . This assumes that the other items of interest are in the same order as our weights and values, perhaps different columns from the same table with these items

Follow-up to Explanation of “Box Cut”

Last month we spent two pages explaining the first line of J here.

'ontit ontap'=: split <;._1&>TAB,&.><;._2 ] 0 : 0
Beer Name  Served In  ABV  Price
Allagash White  16oz. Draft     5.5  $7.00
Bear Republic Roggenbier   14oz. Draft     4.5  $7.00

This used the “cut” conjunction “ ;. “ with the box verb “ < “ and two of cut’s arbitrary numeric qualifiers, “ _1 “ and “ 2 “, to format tab-delimited lines of text into a useful matrix. This combination of the “box” verb with the “cut” conjunction is what I call “box cut”.

Here's a follow-up to that explanation, showing one way this form is useful. In this exercise, we apply the same expression we saw last month to create two tables from tab- and LF-delimited text, then use the resulting variables to track down differences between two moderately large sets of data: lists of the members of the S&P 400 index.

   load 'c:/amisc/Clarifi/THB/sp400s.ijs'  NB. S&P400 members on different dates.

   $&.>on501;<on518         NB. Check the size of each: should be the same.
|400 7|400 7|

   on518 -: on501           NB. Is data the same?  No.

   'on518 on501'=. /:~&.>on518;<on501  NB. Sort them both to be sure...
   on518 -: on501           NB. Still different, but how?

   $on518 -. on501          NB. How many items on new date not on old one?
1 7
   on518 -. on501           NB. What is different one on 5/18 not in 5/1?
|TPX|TEMPUR PEDIC INTL INC|15686101|88023U101|156861|01|Household Durables|
   on518 -.~ on501          NB. What is in 5/1 but not 5/18?
|TNB|THOMAS & BETTS CORP|01054001|884315102|010540|01|Electrical Equipment|

Details of the Script

The script "sp400.ijs" has two entries. The first begins like this:

NB.* sp400s.ijs: constituents of S&P400 Midcap index on different dates in 2012.

'tit on518'=: split <;._1&>TAB,&.><;._2 ] 0 : 0
Ticker     Company Name    Issue ID   CUSIP gvkey iid  Industry
AAN  AARON'S INC     00107601   002535300  001076     01   Specialty Retail
ALK  ALASKA AIR GROUP INC 00123001   011659109  001230     01   Airlines
ALEX ALEXANDER & BALDWIN INC    00125401   014482103  001254     01   Marine
Y    ALLEGHANY CORP  00127401   017175100  001274     01   Insurance
SWKS SKYWORKS SOLUTIONS INC     00132701   83088M102  001327     01   Semicond...

The other global is assigned very similarly but with a different name, starting like this:

'tit on501'=: split <;._1&>TAB,&.><;._2 ] 0 : 0

The value of the title vector "tit" is the same in both cases which is why we re-used the name.

Now when we want to compare another date’s index composition, we have the boilerplate into which we can insert its data.

So, if we’re on a phone call with a client who claims the index did not change between 4/30 and 5/1, we can verify this while we’re talking to him by adding the data for 4/30 into our script and doing the following.

   load 'c:/amisc/Clarifi/THB/sp400s.ijs'

   'on518 on501 on430'=. /:~&.>on518;on501;<on430
   -: / /:~ &.> on430 ; <on501  NB. Sorted tables the same?  No.
   $on501 -. on430
1 7
   on501 -. on430               NB. What are the differences?
|SVU|SUPERVALU INC|01019001|868536103|010190|01|Food & Staples Retailing|
   on501 -.~ on430
|AM|AMERICAN GREETINGS  -CL A|00146801|026375105|001468|01|Household Durables|

Now we can be sure that the index did change and we can specify what the changes were. We see that "American Greetings" was in the index on 4/30 but was replaced by "Supervalu" on 5/1.


Advanced Topics

Learning, teaching, and problem-solving


. -- -- Devon McCormick <<DateTime(2012-06-13T12:59:21-0200)>>