User:Brian Schott/Histogram

From J Wiki
Jump to navigation Jump to search

The purpose of this essay is to expand on Roger Hui's Essays/Histogram especially how (relative) histograms of empirical data are used in statistical circles to capture the nature of theoretical probability distributions. The three main points made here are

a. to contrast Hui's verb, histogram, which collects its frequency interval counts with the dyadic verb I. (Idot), with the interval counts in statistical histograms,

a. to align properly the interval labels and their frequency counts, and

a. to enable the construction of a statistical histogram with rescaled frequency counts when the intervals are not of uniform widths.



I. collects frequency counts based on intervals which are open on the left and closed on the right: (x_i-1,x_i]. Statistical intervals reverse this pattern: [x_i-1,x_i). This problem is dealt with later by reversing the input to Idot, after dealing with the alignment problem, although the reversal problem still exists here, and is just ignored.

histogram =: <: @ (#/.~) @ (i.   @#@[ , I.)
histogram1=: <: @ (#/.~) @ (i.@>:@#@[ , I.)

test1 =: dyad define
 assert. ({:x)>:>./y
 assert. ({.x)<<./y
)


Compare the only plot in Hui's essay and the first plot here in the vicinity of e=100. In Hui's plot, the flat peak of the histogram is to the right of e=100. Here the flat peak is centered around f=100. To accomplish this histogram1 is based on realigned intervals: one fewer interval boundary is input to histogram1, but one additional interval boundary is created by histogram1. One final adjustment must be performed in the case of a continuous variable like the one Hui uses for example data: adjacent interval boundaries are averaged before plotting so the horizontal tick marks align correctly with each interval's center.

   d=: +/ 10 1e6 ?.@$ 21
   e=: 5 * i.40
   f=: }.}: e
   h =: e  histogram d
   h1=: f histogram1 d
   e (histogram-:histogram1) d        NB. should NOT be equal
0
   h1 -: }.h                          NB. should be equal
1
   f test1 d
   ff=: 2+/\-:e

   load 'plot'
   plot ff;h1


Histogram1.jpg

A more traditional statistical histogram is afforded by histogram2 which employs Idotr; uneven interval widths are handled by these revisions with relative and drawplot.

The verb Idotr defined here reverses the application of I. and also adjusts the interval boundaries in the verb histogram2. However, alone histogram2 cannot cope with unequal interval widths. The final example creates unequal interval widths by eliminating the interval boundaries 45 and 50. The verbs relative and drawplot treat the Hui example as if it were to be modeled by a discrete probability mass model, instead of the continuous Gaussian density function. In such cases, the area of each interval is proportional to the interval's relative frequency, rather than the vertical heights being proportional to the relative frequency. The final plot shows the desired result.

histogram =: <: @ (#/.~) @ (i.   @#@[ , I.)
histogram1=: <: @ (#/.~) @ (i.@>:@#@[ , I.)
histogram2=: <: @ (#/.~) @ (i.@>:@#@[ , Idotr)
Idotr =: |.@[ (#@[-I.) ]

relative =: ((2 -~/\ [) %~ }.@}:@histogram2) % #@]
drawplot =: 2&#@[ ; _1&|.@(2&#)@(,&0)@]

test1 =: dyad define
 assert. ({:x)>:>./y
 assert. ({.x)<<./y
)
test2 =: dyad define
  assert. ({:x)>>./y
  assert. ({.x)<:<./y
)


   d=: +/ 10 1e6 ?.@$ 21
   e=: 5 * i.40
   e2 =: e -. 45 50
   e2 test2 d
   'ycaption relative frequency'plot e2 ([ drawplot relative ) d


Histogram2.jpg