Scripts/Forum Time Series

From J Wiki
Jump to: navigation, search
Plotsmooth.png













Purpose

To explore trends in Jsoftware.com forum messages using statistical time series analysis, J's plot procedure and J's data analysis features are very enabling. Through a few examples, we attempt herein to suggest a possible approach which is rather simple and transparent.

Data acquisition

The JForum data are available from the following links: programming, general, beta and chat.

The links below give the full URLs to the first month of data available for each forum organized in "thread" order. Similar links exist organized by "author" and "date" order.

Other data sources have been mentioned such as the one at Gmane in our JForum.

As demonstrated in this forum post, it can be quite easy to collect the required web-based data using the verb httpget (Scripts/HTTP Get). The example session below shows that httpget can be used to confirm that 357 messages were posted to the Programming forum in October, 2007.

   load '~user/httpget.ijs'
   require 'regex'

   A=: httpget'http://www.jsoftware.com/pipermail/programming/2007-October/thread.html'

   'Messages:[^0-9]*([0-9]+)' (,.@:{:@rxmatch ];.0 ]) A
357

First we create a verb yearmonth which constructs the unique portion of the web data links: "2007-October" in the example above was such a unique string. Download script: timeseries.ijs

load '~user/httpget.ijs'
require 'regex'

months =: <;._2 ]0 : 0
January
February
March
April
May
June
July
August
September
October
November
December
)

NB.* year v
NB.  monad triple: start year after 2000
NB.                start month (eg. April = 4)
NB.                number of months
NB.  year 3 4 15 is start at 2003-April and contain 15 months
year  =: (+ 2000+12<.@%~])`(+ <:@i.)/

NB. month v
NB. monad triple: same inputs as year triple
month =: <:@(1&{) 12&|@+ i.@{:

yearmonth =: ,each/@(;/@('-',.~":)@,. at year,:months{~month)

NB.  yearmonth 3 4 15

«readdata»

«datasets»

«movingaverage»

«my3»

«pdutilities»

With the yearmonth string generator, httpget is put into a for. loop in readdata to automate the "Programming JForum" data collection for the process. This is a rather slow process, which took about 3 minutes for the 34 pages accessed in the example. Download script: readdata

urlhead =: <'http://www.jsoftware.com/pipermail/programming/'
urltail =: <'/thread.html'

readdata =: monad define
  result =: i. 0
  y =. ;"1 urlhead,.y,.urltail
  for_x. y
  do.
    temp =. httpget x
    temp =. 'Messages:[^0-9]*([0-9]+)' (,.@:{:@rxmatch ];.0 ]) temp
    result =. result, ". temp
  end.
  result
)

$messages =: readdata yearmonth 5 10 34

By the time others access the messages data above, the most recent month of data may have changed, so to preserve the data we reproduce it and the size data set, used later, below in two nouns. Download script: datasets

   messages =: ;<@".;._2  (0 : 0)
 37 261 335 381 295 576 273 270 172 291 226 285
450 380 372 374 380 456 479 570 357 366 415 234
357 293 371 323 279 477 336 244 290 113
)


   size =: ;<@".;._2  (0 : 0)  NB. in KBs
 14  86 142 136 113 266 100 109  70  94  90 112
174 129 157 137 164 201 219 230 148 149 150  83
143 114 165 130 117 181 149 100 124  27
)

Time series data analysis

To smooth out the effect of seasons from time series data, a simple approach is to compute a centered moving average (CMA) of a multiple of the length of a year. In the case of monthly data, a 12-month (or 24-month or 36-month) CMA is appropriate; the more months in the CMA, the more months are lost from the two ends of the data (6, 12, or 18 months from each end). A CMA is required, instead of a plain moving average (MA) in order to because 12, 24, and 36 are even numbers, and without the centering the MAs would not align correctly with the original months.

Of course, the MA produces the desired smoothing. Traditionally MAs are computed with the statistical mean, but the median -- used here -- possesses a robustness property. The J script library stats contains verb definitions for both mean and median. Download script: movingaverage

   load'stats'

   MA  =: &(median\)                      NB. moving average
   CMA =: 1 : 'm MA@}: -:@+ m MA@}.'      NB. centered moving average

While exercising the moving average verbs, we notice that the number of messages is reduced from 34 to 22 in this example by CMA.

   #messages
34
   #12 CMA messages                       NB. notice 12 are lost
22
   _6]\ 12 CMA messages                   NB. _6]\ is for display only
 283.5  290.5    293    293 312.25 331.5
352.25    375    377    377  378.5   380
 378.5    375 372.75  370.5    365 361.5
359.25 351.75    338 318.75      0     0

Plot data

To get a quick view of these results, try the next plots. They are all crude renderings, but we can see quite a bit from them. From the first see that the first and last months are low and unrepresentative. The third plot is noticeably shorter and smoother, suggesting a growth period and perhaps a decline. The last plot shows the original data and the smoothed data together, enabling us to see them in perspective, but to produce this plot in this manner we lose even the 12-period x ruler.

load 'plot'

'xtic 12' plot messages
'xtic 12' plot }.}:messages
'xtic 12' plot (6+i.20);12 CMA }.}: messages
pd bind 'show' @pd"1 ]((i.32);}.}:messages),:(6+i.20);12 CMA }.}: messages

Some utilities will permit more refined plots.

For example, here we prepare to better label the plot's x axis with the verb my3. fmt is defined here to force the year number to be 2 digits padded in front with a 0, if necessary. If more space is needed and there is only room for 1 digit, you may redefine it: fmt =: ": Download script: my3

   MTH3=: _3 ]\ 'JanFebMarAprMayJunJulAugSepOctNovDec'
   fmt=: [: , ('r<0>2.0')&(8!:2)
   monthyear =: 1 : '[: (;:^:_1) 0 12 <@((,~&fmt) {&m)/@#: ]'
   my3 =: MTH3 monthyear

Experiment my exercising the verb my3.

   my3 14 92 202
Mar01 Sep07 Nov16

Some utilities putting together pd commands are shown next. Download script: pdutilities

pdsetup =: monad define
  pd 'reset'
  pd 'type line,marker'
  pd 'graphbackcolor mediumgray'
  pd 'gridcolor 230 230 230'
  pd 'axes 1 0;axiscolor slategray'
  pd 'color red,blue,green'
  pd 'pensize 2;markersize 1.5'
  pd 'xtic 12'
  'title subtitle' =. y
  pd 'title ',title
  pd 'subtitle ',subtitle
)

pdxlabel =: dyad define
  startmonth =. x
  data =. y     NB. original data before CMA
  xticpos=. ((#data)$0 0 1)#i. #data  NB. 0 0 1 is configurable
  pd 'xticpos ',":xticpos
  pd 'xlabel ',my3 startmonth + xticpos
)

pdshow =: pd bind 'show' @pd"1

Now we show how the pd utilities can be used on two of the earlier plot examples.

   pdsetup 'Monthly message count';''
   70 pdxlabel }.}: messages
   pdshow  }.}: messages

   pdsetup 'Monthly message count';'and smoothed'
   70 pdxlabel }.}: messages
   pdshow((i.32);}.}:messages) ,: (6+i.20);12 CMA }.}: messages

An interesting relationship for the trend of the size of Programming messages is uncovered next.

   pdsetup 'Monthly KBs/message ';'and smoothed'
   70 pdxlabel }.}: messages
   pdshow((i.32);}.}:size%messages) ,: (6+i.20);12 CMA }.}: size%messages
This page was contributed by Brian Schott but incredible contributions were made by

Oleg Kobchenko and Ric Sherlock . Also special thanks to Raul Miller.


CategoryWorkInProgress CategoryCodeNeeded