NYCJUG/2023-03-14

From J Wiki
Jump to navigation Jump to search

Beginner's Regatta

We look at how to extract a sub-string from a larger string.

Extracting Sub-strings

There are a couple of common operations that are useful for extracting particular strings from text. In my own set of utilities, I call these two verbs takeUpTo and dropUpTo.

These are defined this way:

takeUpTo=: ]{.~]i.[     NB.* takeUpTo: take part of string y up to 1st occurrence of character x.
dropUpTo=: ]}.~]i.[     NB.* dropUpTo: drop part of string y up to 1st occurrence of character x.

This is how they work:

   ':' takeUpTo 'title: The Sun Also Rises'
title
   ':' dropUpTo 'title: The Sun Also Rises'
: The Sun Also Rises

How These Work

It should be apparent that these two definitions differ by only a single character: the left or right curly brace which distinguishes take from drop. In both cases, we look up the first occurrence of the character of argument x in the string y with ]i.[. We use this result as the left argument, using ~ to swap right and left, with {. or }..

Show and Tell

Docker Image of J

There is a Docker image of J available. Docker is an open platform that allows you to run an application on a remote system with a particular OS and hardware specification. As the comment below mentions, this image is set up for AMD64 architecture; it is also a Linux version of J.

Jlang docker image

[From https://www.reddit.com/r/apljk/comments/11r5ajr/jlang_docker_image/]

hub.docker.com/r/nesachirou/jlang/

Just found it online.

When running

docker run -it -v $(pwd):/data --rm nesachirou/jlang

after pulling the image, jconsole fires up.

Kudos to its creator.


KaiAnteGeia OP · 3 hr. ago

Note: assumes amd64 architecture. So you can't run it in a machine with different architecture.

Counting Duplicate Digits

This problem was inspired by this Reddit post:

r/explainlikeimfive

• Posted by u/howzitgoinowen 27 minutes ago

ELI5: When you get a text to verify your identity online, why are there usually always double numbers in the code?

This is random, and maybe it's just me, but when I get one of those 6-digit text codes when I need to verify my identity online, almost every time there is a double number, and often two sets of double numbers. I know some of this has to do with probability: The codes are usually 6 digits and with only 10 to choose from (0-9) and assuming they're generated randomly, the odds are fairly high that a number will be repeated. But for me, it seems like it's nearly 100% of the time. And very often more than one number is repeated. For example I just got one that was 868155. Two 8s and two 5s. Do I just not know enough about probability, or is my experience out of the norm? Why does this always seem to occur?

This is easy enough to solve in J. First get a list of all six digit numbers as character strings:

   $nn=. ":&.>1e5+i. 9e5
900000

Then count up how many digits are the same as their neighbor in each string:

   ~.cts=. ;([: +/ ] = [: }. 'x' ,~ ])&.>nn
4 3 2 1 0 5

We would prefer these counts in ascending order, so we do the trick of appending them in the order we want:

  ~.cts=. 0 1 2 3 4 5,;([: +/ ] = [: }. 'x' ,~ ])&.>nn

0 1 2 3 4 5

  #/.~cts

531442 295246 65611 7291 406 10

However, these counts includes our spurious addition so we have to deduct one from each:

   <:#/.~cts
531441 295245 65610 7290 405 9

Finally, divide these counts into the total number of numeric strings:

   9e5%~531441 295245 65610 7290 405 9
0.59049 0.32805 0.0729 0.0081 0.00045 1e_5

So, there are no duplicate adjacent digits about 59% of the time which means there is at least one duplicate pair about 41% of the time. This is probably high enough for confirmation bias to find support.

Determining Some Poker Odds

Once again, we revisit the world of poker, specifically the game known as Omaha. This particular study was motivated by a frequently occurring situation where one has the hand two-pair after the flop. This is a problematic hand because, while it stands a good chance of winning in a game with only two to four players, the chance of it winning declines in a larger game.

The project is to figure out at what odds in this situation it is worth staying in versus folding. The TL;DR for a two-person game is that it is worth staying if the pot offers 2.7 to 1 or better. For example, let's say that in a two-person game there is 1000 in the pot, we have two-pair, and our opponent bets 500. In this case there is 1500 in the pot and the required bet is 500 so the pot offers 3 to 1 odds which is greater than 2.7, so we should call. However, if the opponent were to bet 1000, this gives us 2000 in the pot and requires a bet of 1000 so the pot offers only 2 to 1 odds, which is less than 2.7 to 1, so we should fold.

Looking at the Initial Code

We started with this function for calculating the odds using our Omaha simulations database. It figures out the odds of the scenario where a player has two-pair after the flop. There are two ways to have two-pair in this situation: one where there is no pair in the flop and another where there is a pair or better in the three cards of the flop. The former of these is a better hand since the latter makes the same pair available to all players.

So, the cards after the flop might look like this:

Player      Hole Cards
   1       4♥  4♦ J♣ 2♠ 
   2	   10♣ 7♠ 3♣ 2♥ 
   3	   10♠ 8♠ 5♠ 4♠   <- 10 and 5 here
   4	   Q♦  9♣ 8♣ 7♣ 
   5	   K♦  K♣ A♦ 3♦ 
   6	   J♦  9♦ 6♥ 5♣ 

Flop: A♣ 10♦ 5♦           <- 10 and 5 here

In this case we see that player 3 has two-pair: 10s and 5s.

We look at simulations in the database to identify cases like this and we compare this starting point with the eventual outcome to determine how often the player with the initial two-pair ends up winning the hand assuming no one folds.

Here is the code we first came up with to accomplish this.

tally2PFlopWins=: 3 : 0
   if. 1<#y do. 'y simMax simInc simSt'=. 4{.y
   else. simMax=. >1{,jd 'read max simnum from sim',np,'community' [ simInc=. 10000 [ simSt=. 0
   end.
   np=. ":y    NB. Number of players as character   
   ct=. 0 0 0  NB. Count # of winning 2-ps without and with pair in flop, total #2-ps
   simInc=. >(simInc~:0){10000;simInc [ c24=. 2 comb 4
   while. simInc~:0*.simMax><:simSt+simInc do.
       cond=. ' where simnum<',(":simSt+simInc),' and simnum>=',":simSt
       cc=. ,/>1{"1 jd 'read cc from sim',np,'community',cond
       'hc hh ht fhr'=. 1{"1 jd 'read holecard,highhand,handtype,fhrank from sim',np,cond
       'hc hh ht'=. y reshByNP&.>hc;hh;<ht      NB. Reshape to np-row mats
       wh2p=. +./"1 is2Pair"1 ] a. i. (1 2 0 3|:c24{"1/hc),"1 ] 3{."1 cc
       fr=. 1{"2 suitRank"1 a. i."1 ] 3{."1 cc  NB. Flop ranks
       NB. Mask out 2-pairs that depend on a pair on the board.
       ponf=. 2 +./ . <:"1 #/.~"1 ] fr          NB. Pair (at least) on flop
       NB. Which 2-pairs turned out winners (without and with pair in flop)?
       n2pw=. (0 1,ponf)+//. 0 0,+./"1 wh2p*.fhr=0{a.
       ct=. ct+n2pw,+/,wh2p                     NB. Numbers of winners, total # 2-ps
       simInc=. 0>.simInc<.simMax-simSt=. simSt+simInc
    end.
    ct
)

This code is uglier than it needs to be because of all the loop nonsense, which we will address in the next section. However, we will first go over the code to see how it works.

We start by assigning some parameters. The code was originally written to take only the single parameter of how many players are in the games we are analyzing. This proved to limit how we could break up instances of the code, say for multi-threading purposes, so we introduced the other three parameters to allow us to break a large analysis into arbitrary pieces. Here we see that we either take the parameters from the y argument or assign them to default values (as they were originally done).

   if. 1<#y do. 'y simMax simInc simSt'=. 4{.y
   else. simMax=. >1{,jd 'read max simnum from sim',np,'community' [ simInc=. 10000 [ simSt=. 0
   end.

There is no good reason to retain the original method like this other than the general consideration of backwards compatibility.

We then initialize a few more parameters, including c24 which is a table of the different ways to combine two items out of four.

   np=. ":y    NB. Number of players as character   
   ct=. 0 0 0  NB. Count # of winning 2-ps without and with pair in flop, total #2-ps
   simInc=. >(simInc~:0){10000;simInc [ c24=. 2 comb 4

The main processing takes place in the subsequent loop. The loop was necessary because the database has 300 million simulations which is far too many to process at once, so we process more reasonably sized pieces, starting with this line:

   while. simInc~:0*.simMax><:simSt+simInc do.

and ending the loop with this somewhat involved update of the loop counter:

       simInc=. 0>.simInc<.simMax-simSt=. simSt+simInc

This code was not originally written like this but incorrectly. This is a common problem with this kind of loop where we loop only in order to break a problem down into pieces small enough to handle at once.

The remainder of the code extracts some of the simulation columns and figures out which hands contain two-pair after the flop.

       cond=. ' where simnum<',(":simSt+simInc),' and simnum>=',":simSt
       cc=. ,/>1{"1 jd 'read cc from sim',np,'community',cond
       'hc hh ht fhr'=. 1{"1 jd 'read holecard,highhand,handtype,fhrank from sim',np,cond
       'hc hh ht'=. y reshByNP&.>hc;hh;<ht      NB. Reshape to np-row mats
       wh2p=. +./"1 is2Pair"1 ] a. i. (1 2 0 3|:c24{"1/hc),"1 ] 3{."1 cc

We build the conditional statement for the database to extract reasonably-sized pieces of the simulation data, read in the items, and locate which hands have two-pair after the flop. The flop is determined by the first three community cards 3{."1 cc combined with all possible pairs of the four hole cards for each player like this (1 2 0 3|:c24{"1/hc),"1 .

We then distinguish between the two kinds of two-pairs possible: either a pair in a player's hole cards combined with a pair in the flop versus two cards in a player's hand matching two cards in the flop so we can calculate statistics for each of these two possibilities.

Finally, we separate the two kinds of two-pairs and look at which hands ended up winning using fhr which is the column fhrank extracted from the database, then add the winners to the tally for each kind of two-pair and the total number of hands considered.

       fr=. 1{"2 suitRank"1 a. i."1 ] 3{."1 cc  NB. Flop ranks
       NB. Mask out 2-pairs that depend on a pair on the board.
       ponf=. 2 +./ . <:"1 #/.~"1 ] fr          NB. Pair (at least) on flop
       NB. Which 2-pairs turned out winners (without and with pair in flop)?
       n2pw=. (0 1,ponf)+//. 0 0,+./"1 wh2p*.fhr=0{a.
       ct=. ct+n2pw,+/,wh2p                     NB. Numbers of winners, total # 2-ps

Getting Rid of the Loop

This code appeared to do the job but has that unnecessary loop. The non-looping version is neater and avoids the unnecessary complexity of ensuring that the loop counter combined with the block size does exceed the maximum number of items specified.

NB.* tally2PFW: more functional version to tally 2-pairs-on-flop wins,
NB. broken down by type of 2-p, w/ totals
tally2PFW=: 3 : 0    NB. 
   'np simMin simMax'=. 3{.,y
   assert. (np>:2) *. np<:11     NB. 2 to 11 players per game
   assert. simMin<simMax
   n=. ":np
   NB. Count number of winning 2-ps without and with pair in flop, total #2-ps
   ct=. 0 0 0 [ c24=. 2 comb 4
   cond=. ' where simnum>=',(":simMin),' and simnum<',":simMax
   cc=. ,/>1{"1 jd 'read cc from sim',n,'community',cond
   'hc hh ht fhr'=. 1{"1 jd 'read holecard,highhand,handtype,fhrank from sim',n,cond
   'hc hh ht'=. np reshByNP&.>hc;hh;<ht     NB. Reshape to np-row mats
   wh2p=. +./"1 is2Pair"1 ] a. i. (1 2 0 3|:c24{"1/hc),"1 ] 3{."1 cc
   fr=. 1{"2 suitRank"1 a. i."1 ] 3{."1 cc  NB. Flop ranks
   NB. Mask out 2-pairs that depend on a pair on the board.
   ponf=. 2 +./ . <:"1 #/.~"1 ] fr          NB. Pair (at least) on flop
   NB. Which 2-pairs turned out winners (without and with pair in flop)?
   n2pw=. (0 1,ponf)+//. 0 0,+./"1 wh2p*.fhr=0{a.
   ct=. ct+n2pw,+/,wh2p                     NB. Numbers of winners, total # 2-ps
NB.EG ct6=. tally2PFW 6 0 350e6             NB. Don't try this at home.    
)

The first change we see is that we no longer accommodate ambiguous arguments: we expect three numbers for the number of players and the starting and ending indexes (from the record counter in the database). Instead of explicitly specifying a block size for the amount of data on which we will work, it is implicit as the difference between the start and end indexes.

We have also added two assertions just to be thorough. Other than that, the code is the same as what is in the loop from the earlier function. Essentially we have removed the looping logic from within the function, pushing it up to the invocation level.

This allows us to use the function more easily like this:

   6!:2 'ct6=. tally2PFW"1]6,.2]\1e6*31+i.20'
2757.44

The argument 6,.2]\1e6*31+i.20 gives us blocks of data sufficiently small to be processed and we don't have to worry about the complications of a loop and we are assured that the separate blocks completely cover the range over which we want to work with no worries about overlap or gaps. It looks like this:

   6,.2]\1e6*31+i.20
6 31000000 32000000
6 32000000 33000000
6 33000000 34000000
...

The result gives us the counts and totals for each block:

   ct6
161696 95922 1077704
161502 95853 1077602
160959 95779 1079931
...

Sample Results

We can use results like this to calculate the odds ratio we are seeking

   (}:%{:)"1 +/ct6    NB. Ratio of winning 2-pairs to total 2-pairs
0.149593 0.0890402    
   %(}:%{:)"1 +/ct6   NB. Inverting these ratios gives us odds for both kinds of 2-pair.
6.68483 11.2309

Also, running the simulation across contiguous blocks like this allows us to extract statistics about the variation of the results. The usus verb is my own statistical summary which returns the minimum, the maximum, the mean, and the standard deviation for each column of the matrix argument.

   (2&{%{:)"1 usus 2{."1 ct6   NB. Coefficients of variation for each kind of 2-pair
440.763 511.526

The coefficient of variation is simply the ratio of the mean to the standard deviation. It gives us an idea of how precise our results are. The above ratios of about 400 and 500 show us that the deviation is a small fraction of the mean which means our results should be reliable to about three digits.(?)

Advanced Topics

We consider how to extend string extraction to use more than a single character for targeting a sub-string of interest.

Multi-character String Targeting

In the Beginner's Regatta section we looked at simple code to help extract sub-strings from a larger string based on a single character. These two verbs work well when we can distinguish our sub-strings with a single character. However, in any large amount of text a single character may appear many times other than in the section of interest, leading to spurious results. In this case we may be able to use a multi-character search string. How might we do that?

We know that J uses the find matches verb E. to find matching substrings much the way element of (e.) finds a single item in a larger array. Unfortunately the argument order is reversed between these two similar verbs. That is, the result of e. has the shape of the left argument whereas the result of E. has the shape of the right argument.

For example:

   'string' E. 'sub-string'
0 0 0 0 1 0 0 0 0 0
   'string' E.~ 'sub-string'
0 0 0 0 0 0
   1 e. i.5
1
   1 e.~ i.5
0 1 0 0 0

There is probably good reason for this but I get confused by it.

Let's distinguish these more general versions from their single-character analogs by naming them differently.

   takeUntil=: ] {.~ [: I. E.
   dropUntil=: ] }.~ [: I. E.

Let's test them:

   ':' takeUntil 'title: The Sun Also Rises'
title
   ':' dropUntil 'title: The Sun Also Rises'
: The Sun Also Rises
   'The' takeUntil 'title: The Sun Also Rises'
title: 
   'The' dropUntil 'title: The Sun Also Rises'
The Sun Also Rises

A Problem

However, we run into a problem if there are multiple occurrences of the target string since E. returns a 1 for each one.


   'Hey' takeUntil 'Na Na Hey Hey'
|length error in takeUntil, executing dyad {.~
|takeUntil[:0]

   'Hey' dropUntil 'Na Na Hey Hey'
|length error in dropUntil, executing dyad }.~
|dropUntil[:0]

This is easy enough to fix if we are willing to embrace the semantics of the original definition and key on only the first occurrences of a target string.

   takeUntil=: ] {.~ [: {. [: I. E.
   dropUntil=: ] }.~ [: {. [: I. E.
   'Hey' takeUntil 'Na Na Hey Hey'
Na Na 
   'Hey' dropUntil 'Na Na Hey Hey'
Hey Hey

The modification is inserting the monadic invocation of take like this [: {.. The cap ([:) token is necessary in a tacit definition to signal the monadic use of a verb; it caps the optional left argument so the interpreter knows which ambivalent case to invoke.

An Enhancement

These simple extractions are sometimes insufficient by themselves for common tasks we want to accomplish. Consider the following where we use both our previously defined verbs to extract a string in the middle of a longer string.

   '.' takeUntil 'content:' dropUntil 'some HTML stuff...content: content to get.  etc.'
content: content to get

In a case like this, we probably want to exclude the target string itself as it merely marks the beginning if what we want. This leads us to this variant of dropUntil:

   dropThru=: ] }.~ ([: # [) + [: {. [: I. E.
   '.' takeUntil 'content:' dropThru 'some HTML stuff...content: content to get.  etc.'
 content to get

Here we see we have inserted ([: # [) + to also drop the target string based on its length. We could make a similar modification to takeUntil if want to include the target string at the end of what we are taking.

Learning and Teaching J

Arthur's Assertion

Arthur Whitney has been active on the shakti mailing list and recently made this interesting assertion:

​From: Arthur Whitney <a@shakti.com>
Date: Tue, Mar 14, 2023 at 10:15 AM
Subject: [shakti] Re: python k apl
To: k <k@k.topicbox.com>

python and k(and apl) are very similar. not surprising. it's all data processing. perhaps 80%(?) overlap in expressions.

same tokenization. same punctuation. very similar operators. the basic grammar the same:

v:x*-y

in all languages assignment and unary(prefix) are left-of-right. (execute right to left. no choice.) long infix expressions may differ since sql/c/python/javascript/.. have [differing]precedence rules.

in k(1992) and apl(1958) map-reduce and other wonderful things are primitive.

  punc   infix
p .;()[] =+-*/%&|<> ~         in [map]       unary -~                          other << >> and or not
k .;()[] :+-*%!&|<>=~,#_^@?$. in '/\  +\ +/x unary -~% #!,|+<>_ &^=? @$. oxyel other /\_ g' [n]f/ [n]f\ c/ c\
a  ;()[] ←+-×÷|^∨<>=≠,↑↓  ⍳   ∊  ¨/\ ∘.+ +.× unary -~% ⍴⍳,⍉⌽⍋⍒⌊          ○○○*⍟ other ⊥⊤? x[⍴⍉⌽⌊≤≥]y

k-aggrs count first last min/all max/any sum avg var dev p-aggrs len min all max any sum

k has more classes, more operator-families and many more primitive types and operators than python and apl. so k code can be short.