NYCJUG/FunWithMultibytes

From J Wiki
Jump to navigation Jump to search

Fun with Chinese Characters in J and Emacs

Recently, I received e-mail with embedded Chinese characters. I was pleased to see that they seem to be handled correctly in the emacs editor which is my environment for running jconsole. However, these characters are necessarily in a multi-byte format which can be confusing if not accounted for properly.

   chns=. 0 : 0
在 2012年3月8日 下午1:00,Devon McCormick <devon_mcc…@spcapitaliq.com>写道:
)

   $chns
92
   3{.chns
在

Attempt to get characters before the comma.

   (]{.~','i.~]) chns
在 2012年3月8日 下午1:00,Devon McCormick <devon_mcc…@spcapitaliq.com>写道:

Huh? Did I build the tacit expression correctly?

   13 : 'y{.~y i. '','''
] {.~ ',' i.~ ]

This looks correct. Maybe the comma isn’t a comma. Let's get everything up to an arbitrary character instead. Maybe the character “D” is the same?

   'D' (] {.~ i.~) chns
在 2012年3月8日 下午1:00,

Look at multi-byte funniness: if we look at the last one or two bytes before the first "D", the display isn't too helpful. However, when we look at a complete multi-byte unit (in this case) of three bytes, emacs displays the comma character we're expecting.

   _1{.'D' (] {.~ i.~) chns
\214

   _2{.'D' (] {.~ i.~) chns
\274\214

   _3{.'D' (] {.~ i.~) chns
,

The numeric values of the three bytes comprising this Chinese comma:

   a. i. _3{.'D' (] {.~ i.~) chns
239 188 140