NYCJUG/2023-04-11

From J Wiki
Jump to navigation Jump to search

Beginner's regatta

Our basic adverb to round numbers has three flavors: up, down, and banker's. The last of these, the default banker's, claims to somewhat randomize the direction of rounding by going up or down on the half depending on whether the the prior (scaled) digit is even or odd.

We present it here as an example of "readable" J because it occupies 20 lines of code, the first 7 of which name and make explicit the ranks or are comments which both explain and show by example how the verb works.

We start with the first section:

roundNums=: 3 : 0"1 0
NB.* roundNums: round numbers y to precision x, e.g.
NB. 0.1 roundNums 1.23 3.14159 2.718 -> 1.2 3.1 2.7.
NB. Optional 2nd left argument is single letter specifying type of rounding:
NB. Up, Down, or Banker's, e.g. (1;'D') for down-rounding.  Default banker's
NB. rounding (round halves up or down depending on parity of next (scaled)
NB. digit) tends to randomize bias.
   1 roundNums y
:

These lines name the function and define its left and right default ranks: we take vectors on the left and scalars on the right. The left argument has to be a vector, with scalar extension, because it can be a parameter package consisting of the plain precision argument and a flag indicating which type of rounding to use, default being banker's.

   RT=. 'B' [ TO=. x     NB. Default Banker's rounding; precision to round to.
   if. (2=#x)*.1=L. x do. 'TO RT'=. x end.
   scaled=. y%TO         NB. Bankers round down if last digit even,
   RN=. 0.5*(0~:2|<.scaled)+.0.5~:1|scaled   NB. Default banker's
   select. RT            NB. Rounding Type
   case. 'D' do. RN=. (0.5=1|scaled){0 _0.5  NB. Round halves down
   case. 'U' do. RN=. 0.5                    NB. Round halves up
   case.     do. ''                          NB. Everything else
   end.
   TO*<.scaled+RN
)

We If we first read the comments above, we see that the last statement is that the default rounding is banker's because this "tends to randomize bias". However, an evil (or QA) person could devise a reasonable dataset where this default would introduce a bias. How might we do this?

Example Use

Here we see the three supported rounding methods and how they differ on a simple sequence of numbers; we are rounding to the nearest unit (1) in each of the three methods as indicated by the explicit letter override.

   (1;'D') roundNums 0.5 _0.5 1.5 2.5 3.5 4.5
0 _1 1 2 3 4
   (1;'B') roundNums 0.5 _0.5 1.5 2.5 3.5 4.5
0 0 2 2 4 4
   (1;'U') roundNums 0.5 _0.5 1.5 2.5 3.5 4.5
1 0 2 3 4 5
   

We pay attention to good practice by providing a catch-all empty case statement at the end of the expected ones.

How would the above look rendered tacitly? How maintainable would the tacit version be?

Testing Bias

We will check if this new way of rounding gives answers significantly different from the other methods but first we have to add it. The way the code is written makes this easy to do.

Let's call this new version roundNums1 (remembering to make the internal changes to keep this consistent with the original version).

roundNums1=: 3 : 0"1 0
NB.* roundNums1: round numbers y to precision x, e.g.
NB. 0.1 roundNums 1.23 3.14159 2.718 -> 1.2 3.1 2.7.
NB. Optional 2nd left argument is single letter specifying type of rounding:
NB. Up, Down, Random, or Banker's, e.g. (1;'D') for down-rounding.  Default
NB.  banker's rounding (round halves up or down depending on parity of next
NB.  (scaled) digit) tends to randomize bias.
   1 roundNums1 y
:
   RT=. 'B' [ TO=. x     NB. Default Banker's rounding; precision to round to.
   if. (2=#x)*.1=L. x do. 'TO RT'=. x end.
   scaled=. y%TO         NB. Bankers round down if last digit even,
   RN=. 0.5*(0~:2|<.scaled)+.0.5~:1|scaled   NB. Default banker's
   select. RT            NB. Rounding Type
   case. 'D' do. RN=. (0.5=1|scaled){0 _0.5  NB. Round halves down
   case. 'R' do. RN=. 0.5 0{~?>:0.5=1|scaled NB. Round randomly.
   case. 'U' do. RN=. 0.5                    NB. Round halves up
   case.     do. ''                          NB. Everything else
   end.
   TO*<.scaled+RN
)

We test this by generating random numbers biased to end in 0.5 then test rounding to the whole number.

   $tests=. (] + 0.5 * 0 = 1 | ])(]*_1 1{~ 2?@$~#) -:<.+:+/1e6?@$&>1e10 0
1000000
   usus tests
_9.99999e9 9.99999e9 _1.05782e6 5.7722e9
   0.5+/ . = 1|tests
1000000

This shows we have a million random numbers between negative and positive 10 billion less one; all of them end in a half.

Now we run all rounding methods on all these biased random numbers twice then check if we get different results which is what we expect when including our new random rounding method.

   6!:2 'rr0=. (1;&.>''B'';''D'';''R'';''U'';''oops'') roundNums1 &> <tests'
7.37859
   6!:2 'rr1=. (1;&.>''B'';''D'';''R'';''U'';''oops'') roundNums1 &> <tests'
7.36313
   rr0-:rr1
0

Check that the difference is solely due to the inclusion of the random method:

   6!:2 'rr0=. (1;&.>''B'';''D'';''U'';''oops'') roundNums1 &> <tests' NB. No 'R'
5.75552
   6!:2 'rr1=. (1;&.>''B'';''D'';''U'';''oops'') roundNums1 &> <tests' NB. No 'R'
5.72286
   rr0-:rr1
1 

Looking at some basic statistics - minimum, maximum, mean and standard deviation - on the result of each method, we find very small differences.

   12 12 14j3 15j3":usus"1 rr0
 _9999998110  9999964026   5612284.892 5777665880.576
 _9999998110  9999964025   5612284.392 5777665880.575
 _9999998109  9999964025   5612284.891 5777665880.576
 _9999998109  9999964026   5612285.392 5777665880.575
 _9999998110  9999964026   5612284.892 5777665880.576
   12 12 14j3 15j3":usus"1 rr1
 _9999998110  9999964026   5612284.892 5777665880.576
 _9999998110  9999964025   5612284.392 5777665880.575
 _9999998110  9999964026   5612284.892 5777665880.576
 _9999998109  9999964026   5612285.392 5777665880.575
 _9999998110  9999964026   5612284.892 5777665880.576

Show-and-tell

We outline the development of some J code to retrieve stock price histories for arbitrary stocks.

Getting Stock Price History

We will use Yahoo Finance to get stock price series. To do this, we will set up an example retrieval, then capture the URL generated by a particular set of parameters. We start by manually picking a particular stock and date range.
YahooStockPrice-set date range0.JPG

After we select the "Apply" button to use these settings, we get this URL:
https://finance.yahoo.com/quote/F/history?period1=1672531200&period2=1681084800&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true.

Deciphering the URL

The meaning of most of this string is evident but parts of it are obscure. Clearly, everything from "https..." to "...history?" represents the fixed part of the URL and the "F" in this part is the ticker of the stock in question. The strings separated by ampersands following the question mark are parameters, so it's clear that "period1" and "period2" represent the start and end dates of this selection; much of our initial analysis will be in decoding how these map to our selected dates. We don't really need to worry about the other parameters since we can leave them untouched when we build our custom URL to retrieve data.

To decode the date parameters, we change the start and end period by one day to see how the generated URL changes. We change both dates because the first date we chose, 1/1/2023, is not a trading day and this may affect the date encoding. Changing the starting date a day later to 1/2/2023 and the ending date a day earlier to 4/9/2023, we get this URL:
https://finance.yahoo.com/quote/F/history?period1=1672617600&period2=1680998400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true.

Extracting the two pairs of numbers, we take their difference:

   1672531200 1681084800 - 1672617600 1680998400
_86400 86400

Having a suspicion of what these positive and negative single day differences signify, we check the number of seconds in a day to verify that the unit of these timestamps is seconds.

   */24 60 60
86400

Dividing each of the two period pairs by this number gives us what should be some kind of day number which we confirm by taking the difference of each pair to see that they do differ by two days, as we would expect when the second set starts a day later and ends a day earlier.

   86400%~1672531200 1681084800 ,: 1672617600 1680998400
19358 19457
19359 19456
   -/"1 ] 86400%~1672531200 1681084800 ,: 1672617600 1680998400
_99 _97

Working with Dates in J

Searching the J wiki for "date arithmetic" yields this promising page titled "Addons/types/datetime". Looking at the usage section shows us functions for converting to and from day numbers: toDayNo and toDateTime.

Applying the latter of these to the day numbers derived from the periods in the URL gives us this:

   require 'types/datetime'
   toDateTime 86400%~1672531200 1681084800 ,: 1672617600 1680998400
1853 1  1 0 0 0
1853 4 10 0 0 0

1853 1  2 0 0 0
1853 4  9 0 0 0

These look good except that there is apparently a different implicit starting year between the J routines and the Yahoo Finance URL. We can adjust by adding the 170 year difference to the periods to give us the expected dates.

   toDayNo 2023 1 1,:1853 1 1
81449 19358
   -/toDayNo 2023 1 1,:1853 1 1
62091

This number is the same as this which indicates a standard Unix base date:

   toDayNo 1970 1 1
62091

Adding this offset gives us the dates we expect.

   toDateTime 62091+86400%~1672531200 1681084800 ,: 1672617600 1680998400
2023 1  1 0 0 0
2023 4 10 0 0 0

2023 1  2 0 0 0
2023 4  9 0 0 0

We will assign the "magic number" of the year offset to a global (indicated by global assignment and writing the name in all capital letters):

   YROFFSET=: 62091

Checking our calculation based on the dates against the period numbers, we see that they match:

   YROFFSET-~toDayNo &> 2023 1 1;2023 4 10;2023 1 2;2023 4 9
19358 19457 19359 19456
   86400%~1672531200 1681084800 1672617600 1680998400
19358 19457 19359 19456

Building a URL

Now we combine what we have learned so far to write a function that takes a stock ticker and a date range to produce a URL to retrieve price data for the ticker over the date range. It's often a good idea to start with the function documentation to help us focus on what the arguments should look like, so this is our initial try:

NB.* buildURL: for date range x (start Y M D=0{x; end=1{x) and stock ticker y,
NB. build Yahoo Finance URL to retrieve the price history of the stock over
NB. the date range.
require 'types/datetime'

buildURL=: 4 : 0
   YROFFSET=. 62091 [ DAYSECS=. 86400
   baseURL=. 'https://query1.finance.yahoo.com/v7/finance/download/{ticker}?period1={startDate}&period2={endDate}&interval=1d&events=history&includeAdjustedClose=true'
   baseURL=. baseURL rplc '{ticker}';y
   dts=. DAYSECS*YROFFSET-~toDayNo &> x
   baseURL=. ' '-.~baseURL rplc ,('{startDate}';'{endDate}'),.dts
)

Trying this out, we don't get exactly what we want:

   (2023 1 1;2023 4 10) buildURL 'F'
https://finance.yahoo.com/quote/F/history?period1=1.67253e9&period2=1.68108e9&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

Looking at the original URL we are trying to duplicate, we see that we have a problem representing full precision for long integers.
https://finance.yahoo.com/quote/F/history?period1=1672531200&period2=1681084800&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

This is because we are outputting 1.67253e9 instead of 1672531200 because of our default print precision settings.

We can fix this by formatting the line producing the periods to specify we want 10 (where 10 is a simplification of the more general 10j0 we would use to specify digits after the decimal) digits before the decimal point:

dts=. 10":&.>DAYSECS*YROFFSET-~toDayNo &> x

This change produces a good result.

   (2023 1 1;2023 4 10) buildURL 'F'
https://finance.yahoo.com/quote/F/history?period1=1672531200&period2=1681084800&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

Using the URL

Now that we can build an arbitrary URL to send to our data provider, how do we do this exactly? Fortunately, we have some command line tools, wget and curl, we can use to transmit the URL over the internet to get our data.

Our initial attempts failed because the format of the URL was incorrect. We can't simply mimic the URL we see in a browser. Instead, we have to use a different form so we change the initial assignment of baseURL in our routine to be this:

baseURL=. 'https://query1.finance.yahoo.com/v7/finance/download/{ticker}?period1={startDate}&period2={endDate}&interval=1d&events=history&includeAdjustedClose=true'

Once we've made this change, we generate a URL and send it using curl.

   ]url=. (2023 1 1;2023 4 10) buildURL 'F'
https://query1.finance.yahoo.com/v7/finance/download/F?period1=1672531200&period2=1681084800&interval=1d&events=history&includeAdjustedClose=true
   shell 'curl -o stockPx.dat "',url,'"'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  4666  100  4666    0     0  24502      0 --:--:-- --:--:-- --:--:-- 24557

Notice that we surround the URL with double quotes to avoid problems with any special characters in it.

Our result is in the local file stockPx.dat as specified by the -o parameter above. We check its size and the first 400 bytes of content to see if it looks good.

   fsize 'stockPx.dat'
4666
   fread 'stockPx.dat';0 400
Date,Open,High,Low,Close,Adj Close,Volume
2023-01-03,11.820000,11.920000,11.570000,11.680000,10.987339,45809000
2023-01-04,11.880000,12.080000,11.740000,12.010000,11.297770,53429700
2023-01-05,12.110000,12.380000,11.880000,12.250000,11.523537,50785600
2023-01-06,12.120000,12.590000,12.100000,12.580000,11.833966,53089100
2023-01-09,12.740000,12.930000,12.550000,12.690000,11.937443,50865500
2023-01-

Parsing the Data

Now that we have the data in a regular form, it's simple to parse it into a pair of arrays like this where we separate the initial title row from the data in the following rows.

   'title data'=. split <;._1&>',',&.><;._2 ] LF (],[#~ [~:[: {:]) CR-.~fread 'stockPx.dat'
   $title
7
   title
+----+----+----+---+-----+---------+------+
|Date|Open|High|Low|Close|Adj Close|Volume|
+----+----+----+---+-----+---------+------+
   $data
66 7
   _3{.data
+----------+---------+---------+---------+---------+---------+--------+
|2023-04-04|12.770000|12.850000|12.510000|12.720000|12.720000|54655700|
+----------+---------+---------+---------+---------+---------+--------+
|2023-04-05|12.580000|12.650000|12.220000|12.430000|12.430000|53332900|
+----------+---------+---------+---------+---------+---------+--------+
|2023-04-06|12.420000|12.480000|12.290000|12.330000|12.330000|36355800|
+----------+---------+---------+---------+---------+---------+--------+

The expression LF (],[#~ [~:[: {:]) CR-.~ applied to the contents of the file ensures that any carriage returns are removed and that the last character is a line feed; we assume that the lines of the file are marked by either CRLF or only LFs. Then we apply <;._2 to cut the result on the final LF, discarding these delimiters from the result. Next prepend a comma and cut on this initial character, discarding the commas with the expression <;._1&>',',&.>. Finally, we apply the stdlib verb split to break the resulting table into two pieces - its initial element and all the subsequent ones.

This gives us two items we assign to title for the first row and data for all the rest. Each column of data corresponds to an element of title so we can extract a particular column with a self-documenting line like

   dts=. data{"1~title i. <'Date'
   3{.dts
+----------+----------+----------+
|2023-01-03|2023-01-04|2023-01-05|
+----------+----------+----------+

Advanced topics

We look at how we might use stock price history to find interesting relations among data by running regressions between different sets of data.

Getting Data

First we will combine the steps above to put together an generalized stock price data getter.

cvtDt=: [: ". [: ; [: <;._1 ,  NB. Convert e.g. '2023-04-11' to number 20230411.

getPxs=: 4 : 0
   url=. x buildURL y
   shell 'curl -s -o ',flnm,' "',url,'"' [ flnm=. 'stockPx.tmp'
   'title data'=. split <;._1&>',',&.><;._2 ] LF (],[#~ [~:[: {:]) CR-.~fread flnm
   dts=. '-' cvtDt&>data{"1~title i. <'Date'  NB. Dates as YYYYMMDD numbers.
   adjCls=. data{"1~title i. <'Adj Close'  NB. Closing price adjusted for splits and dividends.
   dts;<".&>adjCls
)

The getPxs routine takes a date range on the left and a stock ticker on the right and returns a list of dates and their associated adjusted closing prices for each day. We convert the dates from their character MM-DD-YYYY form to a numeric YYYYMMDD version.

So, to get data for three American car companies, we can do this:

   6!:2 'dd=. (<2000 1 1;2023 4 10) getPxs&.>''F'';''GM'';''TSLA'''
0.925565
   $&.>dd
+-+-+-+
|2|2|2|
+-+-+-+
   3{.&>{.&>dd
20000103 20000104 20000105
20101118 20101119 20101122
20100629 20100630 20100701

Each element of dd is a two-element vector of vectors for the dates and the prices. We see above that the dates are not the same for all three stocks.

Munging Data

In order to use a nice, simple J expression to regress the returns of one car company against another, we first need to ensure that the two arguments are compatible.

So, say we want to study how the returns of the American car-maker Ford (ticker "F") relate to the returns of other American car companies as well as foreign car companies and to a broad equity index as well. We need series of the same length to do this.

A Date Problem

The first order of business is to align our disparate time series on their dates. We will restrict our discussion to only American companies (and the S&P 500 index) because introducing foreign assets brings with it a host of complicated, data-intensive date problems and we will only look at the simplest of these.

Reloading the data as above but including the S&P 500 looks like this:

   6!:2 'dd=. (<2000 1 1;2023 4 10) getPxs&.>''F'';''GM'';''TSLA'';''^GSPC'''
1.18601
   3{.&>{.&>dd
20000103 20000104 20000105
20101118 20101119 20101122
20100629 20100630 20100701
20000103 20000104 20000105
   _3{.&>{.&>dd
20230404 20230405 20230406
20230404 20230405 20230406
20230404 20230405 20230406
20230404 20230405 20230406

  #&>{.&>dd
5853 3116 3216 5853

The last expression here shows us that Ford and the S&P 500 have the same number of dates and their series have the same starting and ending dates.

Looking at the date ranges available for each series, we see this.

   (<./,>./)&>{.&>dd
20000103 20230406
20101118 20230406
20100629 20230406
20000103 20230406

Finding the minimal date range with ]dtRange=. >./(<./,>./)&>{.&>dd, we see that the minimal common range is from 20101118 through 20230406.

We can define our minimal set of common data using this:

fitRanges=. 4 : 0
NB.* fitRanges: given start, stop dates (as YYYYMMDD numbers), restrict the
NB. dates and associated data in y to fit within the inclusive range.
   'dt dat'=. y
   (<(dt>:{.x)*.dt<:{:x) #&.>y
)

Looking at the minimal common set of data, we see that the dates and data are now all the same length.

   $&>&.>dd=. (<dtRange) fitRanges &.> dd
+----+----+----+----+
|3116|3116|3116|3116|
|3116|3116|3116|3116|
+----+----+----+----+

Using the Common Data

We can extract each set of prices, turn them into return series, and see how well each correlates with the returns of the S&P 500. First we look at the S&P 500 prices and convert them to returns.

   10{.>{:>{:dd
1196.69 1199.73 1197.84 1180.73 1198.35 1189.4 1187.76 1180.55 1206.07 1221.53
   2%~/\10{.>{:>{:dd
1.00254 0.998425 0.985716 1.01492 0.992531 0.998621 0.99393 1.02162 1.01282
   $retsSP500=. 2%~/\>{:>{:dd
3115

Next we do this for each of our car companies.

   $retsF=. 2%~/\>{:>{.dd   NB. Returns for Ford
3115
   $retsGM=. 2%~/\>{:>2{dd   NB. Returns for GM
3115
   $retsTSLA=. 2%~/\>{:>2{dd   NB. Returns for Tesla
3115
   (<retsSP500) %.&> retsF;retsGM;retsTSLA
0.999865 0.997267 0.997267

This last expression shows as that all three companies' returns are very highly correlated to the S&P 500.

We will continue this demonstration of data munging in a future NYCJUG meeting.

Learning and Teaching J

Changing the Way you Write Changes the Way You Think

In the vein of Sapir-Whorf, this video presents a talk on how language affects thinking. Xuanyi Chew, speaking at DataEngBytes 2022, presents his own version of APL for building neural networks, called INiGo (at around 8:08 in the video) for Iverson Notation in Go.

He talks about how he fell into using APL by accident, literally. He had a problem with his arm which limited his ability to type so he thought of trying APL because it would minimize how much typing he would have to do. Interestingly, he came to notice how the notation affected the way he thinks about problems.

He mentions Gorgonia - "Deep Learning in Go" - which is part of what inspired his version of APL. He wants to present neural networks written "in a slightly different way" which leads him to mathematical notation, which he says "kind of scares people away", but he wants to make the presentation of neural net concepts in a way "a form that is familiar and comfortable for software engineers".

INiGo Notational Differences from APL

Chew outlines how his notation differs from APL.

INiGo notation additions.JPG

He recognizes the beauty of APL which he illustrates by this simple statement of Bayes's Theorem:

Bayes's theorem in its simplest form.JPG

An APL Response

On part 1 and 2 of The APL Show's response to this talk, Adám Brudzewsky and Richard Park offer a reaction video to points of Chew's talk. They play the talk and comment on it as it goes on. This may be the more efficient way to watch Chew's talk since you get more information per minute.

They mention Rodrigo Serrao's work on neural nets as an example of a powerful notation of "U-Net CNNs". Rodrigo has also produced about six hours of video detailing how to build a neural net in APL; though the focus is on learning APL, this tutorial illustrates how well suited the language is for this sort of thing which is where Mr. Chew started from in his talk.

Are we in the Age of Average?

Based on this article, our contemporary world seems to be converging on a sameness in many areas: current ideals of feminine beauty, shapes of passenger cars, and many others. The article begins with a story about some researchers who asked 1,001 US citizens a series of survey questions about what they like in a work of art, in a painting in particular.

The first major conclusion of their study, after conducting it across different nations, is that "[d]espite soliciting the opinions of over 11,000 people, from 11 different countries, each of the paintings looked almost exactly the same." Moreover, the paintings they created based on each country's preferences were uniformly fairly bad. The article extrapolates from this to argue

that from film to fashion and architecture to advertising, creative fields have become dominated and defined by convention and cliché. Distinctiveness has died. In every field we look at, we find that everything looks the same.

Here we some illustrations of this in architecture and housing.

First, in common American condominiums.

Sameness of American five-over-one architecture 50p.JPG

Then in the general look of different cities around the world.

Sameness of cities around the world 50p.JPG

Finally, in what everyone thinks a "Bed and Breakfast" should look like.

Sameness of international BnBs 50p.jpg

Do we see something similar to this with programming languages? Once upon a time, languages were supposed to look like Fortran, then like C, now like Python.

The Cost of Context-switching

This article tries to estimate the cost to a programmer of frequent interruptions. One of the good points it makes is that it is invaluable to have an IDE that can save the state you were in the last time you used it so you don't have to remember what files to load and what exactly you were doing the last time you stopped work on some code.

Materials

Code to build and launch URLs to get stock price data from Yahoo Finance: File:BuildURL.ijs