Originally presented on 5/13/2007 to the NYC Financial Engineering Meetup. It illustrates the use of J to manipulate data quickly and easily in order to explore topics in quantitative financial research. The previous work is here and the continuation of this is here.

# More Data Munging

Now we'll work on removing parts of the data that give inordinate trouble. For instance, when we look at all the daily (price) returns,

``` \$rets=. ret1p&.>3{"1&.>CODAT
848
\$;rets
1881993
```
``` we see that we have almost 1.9 million observations. However, we run  into trouble as soon as we attempt to look at the return distribution:
```
``` load'mystats'
ss=: 'Returns Distribution' plotHistoMulti ;rets
|domain error: to
| x.+(1{y.)*(*x.-~0{y.)* i.>:<.0.5+(|x.-0{y.)%1{y.
usus ;rets
__ 0.87735849 __ _
```
``` This last set of numbers tells us our lowest return is negative  infinity, our highest is 87.7%, the mean return is negative infinity and  the standard deviation is infinity. The infinities are due to zeros in  the price data, so let's see what the stats are if we remove these  (negative) infinities from the returns:
```
``` replInf0=: 3 : '(y.=__)}y.,:0'
rets=. replInf0&.>rets
usus ;rets
_101.6 0.87735849 _0.21019811 0.42054006
```
``` These numbers are more reasonable but the low extreme is still doubtful:  the lowest return is about -10,000%, and the mean is about -21%, which  is possible but seems somewhat unlikely.
```

Looking at the ten lowest and highest returns tells us that there are some outliers:

``` (;rets){~10{./:;rets
_101.6 _84.94 _28.25 _26 _4.3333333 _3.7777778 _1.8849558 _1.8333333 _1.8064516 _1.729927
(;rets){~10{.\:;rets
0.87735849 0.78333333 0.76470588 0.75324675 0.62962963 0.56045752 0.54189944 0.52631579 0.51775148 0.46969697
```
``` We see about 4 low returns in the negative thousands of percent with a  quick increase to a flatter (few hundreds) set of negative returns. On  the positive side, the change from the extreme is more gradual so these  outliers are less different from the body of the observations.
```

Some more work lets us know that a few of our data series are very short; the shortest are only one or two observations:

``` pxs=. _1-.~&.>3{"1&.>CODAT
<./#&>pxs
1
I. 1=#&>pxs
781 847
```
``` Companies at indexes 781 and 847 have only 2 and 1 observations:
```
``` \$&.>781 847 { CODAT
+---+---+
|2 6|1 6|
+---+---+
```
``` These companies are:
```
``` 781 847 { COINFO
+----+--------------------------------+--------+------+----------+
|MNST|Monster Worldwide Inc. |61174210|064156| 6/04/2001|
+----+--------------------------------+--------+------+----------+
|LLL |L-3 Communications Holdings Inc.|50242410|110685|12/01/2004|
+----+--------------------------------+--------+------+----------+
```
```The sizes of the ten smallest series are:
```
``` (#&>CODAT){~10{./:#&>CODAT NB. 10 smallest data series
1 2 3 24 27 28 44 46 66 87
```
``` If we have to pick a cut-off below which we will discard a series, it  looks like there's a natural break around 40. This would eliminate these  companies:
```
``` I. 40>#&>pxs
220 781 844 845 846 847
COINFO{~I. 40>#&>pxs
+----+------------------------------------+--------+------+----------+
|HLT |Hilton Hotels Corp. |43284810|005643| 1/02/1990|
+----+------------------------------------+--------+------+----------+
|MNST|Monster Worldwide Inc. |61174210|064156| 6/04/2001|
+----+------------------------------------+--------+------+----------+
|COH |Coach Inc. |18975410|140541| 9/01/2004|
+----+------------------------------------+--------+------+----------+
|CIT |CIT Group Inc. |12558110|149738|10/27/2004|
+----+------------------------------------+--------+------+----------+
|LH |Laboratory Corp. of America Holdings|50540R40|014960|11/01/2004|
+----+------------------------------------+--------+------+----------+
|LLL |L-3 Communications Holdings Inc. |50242410|110685|12/01/2004|
+----+------------------------------------+--------+------+----------+
```
``` The latter four look like they joined the S&P near the end of our  date range. The other two must have dropped out soon after the start of  our date range (Hilton Hotels) or soon after being added to the index  (Monster Worldwide).
```

So, let's remove these few companies based on how few observations they comprise. We have to adjust three of our global variables:

``` exclsz=. 40<#&>CODAT NB. Eliminate short series
'COIX CODAT'=: (<exclsz)#&.>COIX;<CODAT
COINFO=: (1,exclsz)#COINFO
```
``` Now, looking the returns with these removed, we see there are still some  suspicious returns of -100%, so let's remove these as well and look at  the statistics on the sum of our observations:
```
``` rets=. ret1p&.>_1-.~&.>3{"1&.>CODAT
\$;rets
1881564
usus ;rets
_1 0.87735849 _0.21003814 0.40836759
\$_1-.~;rets
1485670
```
``` We still have 1.48 million observations so none of these eliminations have cost us too many  observations.
```
``` rets=. _1-.~&.>ret1p&.>_1-.~&.>3{"1&.>CODAT
usus ;rets
_0.85158151 0.87735849 0.00046699481 0.024474664
```
``` (compare this (min, max, mean, std dev) to final winsorized stats).
```

Now the lowest and highest returns are of similar magnitude and the mean is slightly positive.

However, our graph of the distribution of returns is still dominated by the extremes:

``` ss2=. 'Returns Distribution After Initial Eliminations' plotHistoMulti ;rets
```
```
```

Looking at the high and low extremes:

``` 5 10\$0.001 roundNums rr{~50{./:rr NB. 50 lowest returns
_0.852 _0.851 _0.766 _0.761 _0.636 _0.623 _0.614 _0.61 _0.597 _0.585
_0.573 _0.537 _0.528 _0.519 _0.496 _0.495 _0.495 _0.494 _0.493 _0.49
_0.48 _0.48 _0.475 _0.471 _0.467 _0.465 _0.46 _0.455 _0.446 _0.445
_0.445 _0.439 _0.438 _0.436 _0.425 _0.424 _0.424 _0.422 _0.422 _0.418
_0.415 _0.41 _0.409 _0.408 _0.405 _0.401 _0.4 _0.4 _0.4 _0.398

5 10\$0.001 roundNums rr{~50{.\:rr NB. 50 highest returns
0.877 0.783 0.765 0.753 0.677 0.63 0.56 0.542 0.526 0.518
0.47 0.458 0.451 0.45 0.438 0.427 0.424 0.42 0.41 0.409
0.405 0.394 0.391 0.389 0.384 0.384 0.379 0.369 0.368 0.367
0.366 0.363 0.36 0.358 0.355 0.353 0.351 0.348 0.348 0.348
0.347 0.339 0.338 0.333 0.332 0.331 0.331 0.33 0.33 0.326
```
``` These don't look unusual but they make it hard to see the distribution  of the main body of returns. To get a better look at this bulk of the  observations, we'll use a technique called "winsorizing". This reins in  the extreme values by capping them at some arbitrary value closer to the  median of the observations.
```

So, let's winsorize the observations at both ends to concentrate on the bulk of the distributions for now. Trying a couple of different points at which to eliminate the outliers, I decided to winsorize at the 4000th observation on both ends.

```NB.* winsorizeAt: winsorize top and bottom at xth highest/lowest observation.
winsorizeAt=: 4 : 0
grd=. /:y
y=. (y<:y{~x{grd) }y,:y{~x{grd
y=. (y>:y{~(-x){grd) }y,:y{~(-x){grd
)
ss2=. 'Returns Distribution - Winsorized at 4000th' plotHistoMulti 4000 winsorizeAt ;rets
```
```
```
```   (#rr)%~rr+/ . =4000 winsorizeAt rr
0.99461657
```
``` This gives a good look at the bulk of the data, over 99% of which is  unaffected by winsorizing. You can see the winsorized data appearing in  the highest and lowest bins of the histogram, making these a little  taller than one would expect given the trend of the number of  observations decreasing toward the tails of the distribution.
```

We can also look at our usual statistics on the winsorized data:

```   usus           NB.  The definition of "usus"
3 : 0
if. 0=L. y do. (]`|: @.(1<#\$y))(<./,>./,mean (,`,: @.(1<#\$y)) stddev) y
else. (<./,>./,mean,stddev)&> y end.
)

usus 4000 winsorizeAt rr
_0.09881698 0.10666229 0.00048500529 0.022861136
```
``` Notice that the mean and standard deviation remain very similar to those of the unwinsorized data ("usus" stats at end of observations).
```

The previous work is here and the continuation of this is here.