Guides/Unicode

From J Wiki
Jump to navigation Jump to search

J has essentially complete support for unicode in code development and applications. The only minor limitation is that identifiers used in programming must be in 7-bit ascii, but this does not affect the use of unicode in applications. For example:

   a=. '沒有問題'        NB. assign unicode text to a

   沒有=. 1 2 3         NB. identifiers must be 7-bit ascii
|spelling error

Literal text is assumed to be in utf8 format. J also has a 2-byte unicode datatype, and the verb u: converts back and forth. Both representations can be useful when programming, so take care to ensure the right datatype is being used.

utf8 is used in:

window driver interface
file name in 1!:x family
plot package interface
regular expression of pcre
*c argument of dll

2-byte unicode used in:

manipulation of character array
*w argument of dll

Standard utilities include:

utf8 convert to utf8
ucp convert to unicode datatype (cp=code point), if necessary
uucp convert char or utf8 to wchar
ucpcount code point (glyph or character) count
datatype noun data type

The name a defined above is in literal text, and therefore assumed to be utf8. More examples:

   a
沒有問題

   datatype a     NB. a is type literal
literal

   #a             NB. the count of a is the count of its utf8 representation
12

   a. i. a        NB. bytes in the utf8 representation
230 178 146 230 156 137 229 149 143 233 161 140

   b=. ucp a      NB. b is a converted to 2-byte unicode

   b              NB. b displays the same as a
沒有問題

   #b             NB. the count of b is the number of characters
4

   datatype b     NB. b is type unicode
unicode

   a -: utf8 b    NB. utf8 converts b back to a
1

Scripts

Script cp2utf8 converts plain text files in codepages to utf8.
Script ufread reads unicode text files in various formats.

Renaming Unicode Files

We will define a win32 API verb

NB.*mv v move file, e.g. from mv to
MoveFile=: 'kernel32 MoveFileW > i *w *w' cd ;&uucp

For testing we will create a file in one unicode range,

   load'files dir'
   'test' fwrite 'Test - 沒有問題'            NB. create a file
4
   fread 'Test - 沒有問題'
test
   0 0{:: 1!:0]'Test -*'                     NB. dir find
Test - 沒有問題

rename into another and be able to read it by new name.

   'Test - 沒有問題' MoveFile 'Test - Без проблем'    NB. rename
1
   fread 'Test - Без проблем'
test
   0 0{:: 1!:0]'Test -*'                     NB. dir find
Test - Без проблем

Links

Guides/UnicodeGettingStarted - notes on using unicode
Vocabulary entry for u: - definition of verb u:
Unicode Test Drive - Oleg's notes on unicode
UTF-8 and Unicode Standards good background reading