Guides/UnicodeGettingStarted

From J Wiki
Jump to navigation Jump to search

Getting Started with Unicode in J

See also: Guides/Unicode and Voc(u:)

This is a quick page of notes, mainly as a reference for answering my own questions.

Maybe it will help a J beginner with the task of looking up, choosing and using an arbitrary Unicode glyph. Not an easy task for a complete beginner, using only the existing documentation.

So turn your IQ philco down to 60 and let's go...

TIP: on the first reading, skip over small stuff like this.


Domino: a sample Unicode letter to play with

Here's a sample Unicode letter:

...Yes, it's Domino from the APL character set. But in J602 it could be any item from a collection of 65536 code points.

We prefer to call a letter, not a character, for reasons which will soon become clear.

Not every code point has an associated letter. And even if it has, the font you're using may not have it.

J602 supports only the plane-zero codespace (Range: 0–FFFF).
There are now 17 possible planes in the whole Unicode codespace. Plane-1 starts with the Linear B Syllabary (Range: 10000–1007F).

If you've got the font: APL385 Unicode installed, you'll see the letter -- a split square with two dots -- otherwise you may not.

If you don't see the letter properly, then refer to the last section of the APL to J Phrasebook and troubleshoot from there.
(For even greater detail, see: Typesetting/APL Fonts)

Copy-paste Domino () into the j602 session window (IJX) and embed it in the sentence:

   z=: 'abc⌹e'
   $z
7
   NB. ...not 5, as you might expect but 7
   z i. 'ce'
2 6
   z i. '⌹'
3 4 5

You may ask yourself: WTF is going on...?

Domino () happens to be encoded in the J session window in the UTF-8 standard.

UTF-8 is the correct official name of the encoding standard for embedding a Unicode letter within a string of ascii (7-bit) chars in the form of 2 or more byte codes.

Inside the literal array: z, Domino actually consists of not 1 but 3 consecutive atoms: 3 4 5{z . The ascii letters either side of it (viz 'c' and 'e') are the atoms 2{z and 6{z .

   datatype z
literal
   datatype '⌹'
literal
   $ '⌹'
3

This shows that J thinks the noun '⌹' is simply a 3-byte string.

If you need to tabulate, or index, Unicode letters within an orderly array, you must convert the whole string z to wchars (read: "wide-characters") which are each 16-bits wide:

Convert z to a new datatype: 'unicode' (a list of wchars, or "wide characters")...

   ]zwide=: 7 u: z
abc⌹e
   $zwide
5
   zwide i. 'ce'
2 4
   ]Domino=: 3{zwide
⌹
   zwide i. Domino
3
   $$Domino
0
   datatype Domino
unicode

This shows that noun: Domino is scalar, a single atom of datatype: 'unicode' .

You now have an orderly vector zwide consisting of 5 wchars, which behave themselves under $ and i..

Yes, the ascii letters: abce have become 16-bit wchars too!
As you'll recall, the whole of an unboxed array must have the same datatype -- in this case 'unicode' .


DEFINITIONS: glyph, grapheme, char, codespace, code point

Now it's time for a few definitions, to help avoid confusion in the terms we use.

An extensive glossary including these terms is available at [1].

But our definitions here will be more informal. They're intended to explain how a Unicode code point can have two different glyphs,

e.g. A and A are two different glyphs

But both have the same code point U+0041

and how a given glyph can be shared by two separate code points,

e.g. µ (U+00B5) and μ (U+03BC).

These glyphs look the same in many fonts. But not, for instance, in 'Courier'.

Glyph

The image you see on the screen, typically a letter of the Roman alphabet (or the alphabet / syllabary, of some other language).

  • Here's an ascii glyph: A
  • Here's another: @
  • Here's an APL glyph:
  • Here's a Chinese glyph: 有

Note that if you set the letter A in italics: A -- the screen displays a different glyph!

Grapheme

What most people think of as a "character".

[2] defines grapheme as "a minimally distinctive unit of writing in the context of a particular writing system."

The point is that a grapheme is defined with respect to a writing system. Now SI-units and Greek are two different writing systems. Therefore µ (U+00B5) and μ (U+03BC) are different graphemes.

Char

A term taken from other programming languages, notably C, to describe a particular data type.

Thus, C programmers will write char in analogy with int or float to specify that a variable has this data type.

For example:

/* Sample declarations in C */
    char *p1, *p2, *endwrd;
    char t;
    int swaps;

In C, a value of type char is (or was originally) encoded in a single byte of memory, giving 256 possible distinct variants, numbered 0 to 255. Programmers also conventionally number these variants in hexadecimal numerals: 00 to FF.

In J the hexadecimal (hex) numbers: 00 01 02 ... FF can be written: 16b00 16b01 16b02 ... 16bff

In J, there's a corresponding datatype called: literal. All 256 possible values of datatype literal are listed in the noun: (a.).

   $ a.
256

We shall use char or character solely in this limited sense: to mean the binary encoding in memory of char (C) or literal (J) data. Not unicode letters, in whatever of their many representations.

But we need a word to use in an informal sense, to refer to the "thing" that exists inside the computer as an encoded byte, or an encoded string of bytes, and which appears on the screen as a single glyph. We shall use the word letter for this "thing".

If forced to define what we mean by letter, we shall say it depends on context. Mostly it will mean a computerized grapheme.

WARNING: The unicode.org literature (which we shall quote below) appears to use the word "character" in an informal sense, to mean what we do by letter. That's why we prefer to avoid the word "character", because IT people use it to mean different things without realizing it -- a source of great confusion.

Codespace

The range of integers which number all possible encoded variants of a given datatype.

(or "character set" in the old parlance.)

 NAME                 RANGE (dec)    RANGE (hex)        RANGE (J's hex)
 ====                 ===========    ===========        ===============
 ascii standard       0 to 127       00 to 7F           16b00 to 16b7f
 char (C)             0 to 255       00 to FF           16b00 to 16bff
 literal (J)          0 to 255       00 to FF           16b00 to 16bff
 unicode (J)          0 to 65535     0000 to FFFF       16b0000 to 16bffff
 Unicode standard*    0 to 1114111   000000 to 10FFFF   16b000000 to 16b10ffff

* NOT supported by j602.

Here's how Unicode.org defines codespace [3]:

  • (1) A range of numerical values available for encoding characters.
  • (2) For the Unicode Standard, a range of integers from 0 to 10FFFF.

Code Point

A number in a given codespace.

Here's how Unicode.org defines code point [4]:

  • (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF.
    Not all code points are assigned to encoded characters.
  • (2) A value, or position, for a character, in any coded character set.

NOTE: Unicode code points are conventionally written: U+2339 -- the numeral always being in hex.


Choosing a sister letter of Domino

Now consider this task: you've pasted a given Unicode letter from some given Unicode-compliant software or document. You don't know that it's from APL, but you like the glyph and you want to use another glyph from the same collection. Perhaps you've heard of another letter called Quote Quad, looking like this:

You can look up the code point on the Unicode.org website and download its table of letters and glyphs in PDF-form.

A good place to start if you haven't the foggiest idea where to find your letter is here: http://www.unicode.org/standard/where/

The main code charts are here: http://www.unicode.org/charts/index.html

The page is titled: Unicode 6.0 Character Code Charts.

If you know it's an APL letter you want, then simply search the page for "APL". (You'll find it here: http://www.unicode.org/charts/PDF/U2300.pdf)

You don't know it is an APL letter? Then you must find its code point and look it up in the most general way.


Finding the code point of a pasted letter

   3 u: 7 u: '⌹'  NB. The code point (int), given the letter
9017
   require'convert'
   hfd 9017       NB. hex from dec: 9017 to look up in unicode.org
2339
   NB. Let's just confirm that hex numeral is correct...
   u: 16b2339
⌹

Looking up a letter by its code point

At the top of the Code Charts page, http://www.unicode.org/charts/index.html there's a search box labelled: Find chart by code:

You need to type a hex numeral in that box (...the code point), viz the one you've just computed: 2339.

This reports to you:

Search Results for U+2339

    The most current code chart containing U+2339 is:

        http://www.unicode.org/charts/PDF/U2300.pdf (0.3 MB)

...and the link downloads the relevant table (U2300.pdf)

From this document you can look up Quote Quad and find that its code point is 235E.

Now try out 235E (which in J becomes 16b235e)...

   u: 16b235e
⍞

Unicode support in J

Unicode support in J means that a string can be one of two datatypes:

  • 'literal' -- atoms are old-fashioned 1-byte letters: ascii (7-bit) and superascii (8-bit). J cannot recognise and distinguish single (unicode) letters, which may be 1, 2 or 3 atoms long. Non-ascii letters are coded in the UTF-8 standard.
  • 'unicode' -- all atoms are the new 2-byte wchars. A unicode letter of whatever kind is always a single atom. An array of this datatype is well-behaved under J structure operations, eg From ({), Amend (}), Take ({.), Drop (}.).

The _z_ locale contains 4 verbs to help you manage 'unicode' arrays. These verbs are defined in stdlib.ijs, hence they are always present:

  • ucp -- converts 'literal' to 'unicode' -- but not if it's ascii-only!
  • uucp -- converts 'literal' to 'unicode' whether or not it contains non-asciis.
  • ucpcount -- reliably counts the ucp or "unicode code point" letters in a string of either datatype.
  • utf8 -- converts 'unicode' to 'literal', turning non-ascii letters into multi-atom substrings.

REMINDER: utf8 is the name of the stdlib verb.
The name of the coding standard is UTF-8.

These four verbs, plus Unicode (u:) itself, give you a toolkit for handling Unicode. You'll also find the following verb useful:

cp=: 3 u: 7 u: ]   NB. code-point (decimal) of letter: y

Use uucp to convert a string to 'unicode', do any text manipulation, then use utf8 to convert back to 'literal'.

Given any code point, even one in the ascii codespace, use Unicode (u:) to see its letter. The result always has datatype 'unicode', even for an ascii code point. With this proviso, you can use it with any code point, even one in the ascii range, where you might otherwise use eg (65{a.) ...

   u: 9017
⌹
   u: 65
A
   datatype u: 65
unicode
   65 { a.
A
   datatype 65 { a.
literal

Superasciis and UTF-8 encoding

Have you ever wondered about those little garbage strings of extended-Latin you sometimes see scattered on badly-designed web-pages? These are instances of UTF-8 encoded text being displayed as if it were ascii-only. You'll notice that the garbage glyphs are either extended-Latin or that collection of non-ascii glyphs like § and © which, right from the early days of the IBM PC, were encoded in single bytes using 8-bit codes above the ascii codespace ("superasciis").

We can force these artifacts to appear by using Unicode (u:) to decode a 'literal' string containing non-ascii letters such as smart-quotes and en-dashes. Unicode (u:), used in this way, will wade through the string byte-by-byte, treating each byte as a binary integer and trying to interpret it as a (unicode) letter, or part of one...

   u: '“Superasciis” – now used by J in ‘UTF-8 encoding’.'
“Superasciis” – now used by J in ‘UTF-8 encoding’.

As we've seen, the old ascii codespace is embedded in the new Unicode codespace, in its lowermost range.

But what about the so-called "superascii" codespace (80 to FF, or 128 to 255)? Formerly this was used on some platforms to encode European language alphabets: Latin-1 Supplement (PDF). Does it still?

The answer is yes: up to a point. Verbs cp and u: will obligingly interpret the superascii codespace for you:

   cp 'Français'
70 114 97 110 231 97 105 115
   u: 70 114 97 110 231 97 105 115
Français
   cp 'ç'
231
   u: 231
ç

Noun (a.) is rather less obliging. J now uses superasciis to commence a UTF-8 encoded letter, therefore single isolated "superascii" bytes show: �

� is the glyph which serves as a placeholder for any unknown char

   a. i. 'Français'
70 114 97 110 195 167 97 105 115
   70 114 97 110 195 167 97 105 115 { a.
Français
   a. i. 'ç'
195 167
   195 167 { a.
ç
   195 { a.
�
   167 { a.
�

Gotchas to watcha

WARNING: if you write '⌹' in the J session or an IJS window, it is always 'literal', ie a string of 3 atoms, not a scalar atom.

It follows that J cannot tell the difference between '⌹' and (,'⌹')

   '⌹' -: (,'⌹')
1

So if you want the code point of '⌹' as a scalar number, not a vector of length 1, be sure to use: {.

   cp '⌹'
9017
   $ cp '⌹'
1
   {. cp '⌹'
9017
   $$ {. cp '⌹'
0

Up-to-date versions of familiar fixed-width fonts like Courier New and APL385 Unicode tend to have most of the glyphs you'll ever need, even APL ones. (But maybe not the character sets of less well-known languages in the West.) If you don't see the glyph you want, try installing a newer version of the font in question.

WARNING: even in a so-called fixed-width font, the width of the more unusual glyphs may vary.

Try pasting this block into an IJS window:

aaa
AAA
...
⌯⌯⌯
❤❤❤
⌹⌹⌹

In many fixed-width fonts (e.g. 'Courier') the hearts come out wider than the other letters and disrupt the table layout.

A glyph-design failing is often shared across many of the popular freeware fonts. The font foundries that own these fonts have often licensed entire codespaces from each other, rather than go to the trouble of designing, say, a Courier or a Menlo style of the many thousands of glyphs an up-to-date font is expected to show. The mind boggles, of course, at the idea that there must be a distinctive Baskerville or a Comic Sans style of the Arabic letter "ayin", or conversely a Naskh style of the letter A.


A script to explore Unicode letters and their glyphs

File:Cu.ijs

Verb cp displays the code-point(s) of a single Unicode (UTF-8) letter, or a string containing UTF-8 letters.

Verb cu displays the full details of any given letter in both hex and decimal. It also looks up the code table for the letter which can be downloaded from Unicode.org, provided the code table contains mathematical or APL letters.

The letter can be ascii or UTF-8, as displayed in the j602 session.

The hex form is shown in the classic Unicode.org designation, suitable for looking-up at Unicode.org, e.g. would be: U+2339.

Download the above script and place it in your "user" folder (say). If the script loads without error, you'll see a list of sample sentences to execute...

   load '~user/cu.ijs'
	TEST: enter:
cpid '81'   NB. identify unicode.org code table of code pt: u+0081
cpid 129    NB. same as: cpid '81'
cp '⌹⍞'     NB. code points of 2 (APL) letters (9017 9054)
cp '⌹'      NB. (1-vect) code point of a single (APL) letter
{. cp '⌹'   NB. (scalar) code point of a single (APL) letter
	TEST: cu -- identify given (copied) letter
cu '⌹'      NB. std details of a single (APL) letter (U+2339)
cu '⌹⍞'     NB. std details of 2 (APL) letters
10 cu '⌹'   NB. std details of letters 10 either side of ⌹
cu 16b2339  NB. same as: cu '⌹'
cu 9017     NB. same as: cu '⌹'
cu 'π'      NB. the letter: π -- in either Greek or Mathematics
cu 16b03c0  NB. same as: cu 'π'
cu 960      NB. same as: cu 'π'
cu '⍺⍵∊⍳⍴'  NB. APL primitives
cu 'αωειρ'  NB. Greek letters
cu '~'      NB. the ascii letter: ~ (=U+007E)
cu 16b007e  NB. same as: cu '~'
cu 126      NB. same as: cu '~'

Executing the sample sentences, you see:

   cpid '81'   NB. identify unicode.org code table of code pt: u+0081
┌────┬───────────┬─────────┬──────────────────────────────────┐
│0081│[U0080.pfd]│0080-00FF│C1 Controls and Latin-1 Supplement│
└────┴───────────┴─────────┴──────────────────────────────────┘

   cpid 129    NB. same as: cpid '81'
┌────┬───────────┬─────────┬──────────────────────────────────┐
│0081│[U0080.pfd]│0080-00FF│C1 Controls and Latin-1 Supplement│
└────┴───────────┴─────────┴──────────────────────────────────┘

   cp '⌹⍞'     NB. code points of 2 (APL) letters
9017 9054

   cp '⌹'      NB. (1-vect) code point of a single (APL) letter
9017

   {. cp '⌹'   NB. (scalar) code point of a single (APL) letter
9017

   cu '⌹'      NB. std details of a single (APL) letter
⌹ U+2339 9017 [U2300.pfd] 2300-23FF Miscellaneous Technical

   cu '⌹⍞'     NB. std details of 2 (APL) letters
⌹ U+2339 9017 [U2300.pfd] 2300-23FF Miscellaneous Technical
⍞ U+235E 9054 [U2300.pfd] 2300-23FF Miscellaneous Technical

   10 cu '⌹'   NB. std details of letters 10 either side of ⌹
⌯ U+232F 9007 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌰ U+2330 9008 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌱ U+2331 9009 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌲ U+2332 9010 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌳ U+2333 9011 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌴ U+2334 9012 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌵ U+2335 9013 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌶ U+2336 9014 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌷ U+2337 9015 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌸ U+2338 9016 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌹ U+2339 9017 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌺ U+233A 9018 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌻ U+233B 9019 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌼ U+233C 9020 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌽ U+233D 9021 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌾ U+233E 9022 [U2300.pfd] 2300-23FF Miscellaneous Technical
⌿ U+233F 9023 [U2300.pfd] 2300-23FF Miscellaneous Technical
⍀ U+2340 9024 [U2300.pfd] 2300-23FF Miscellaneous Technical
⍁ U+2341 9025 [U2300.pfd] 2300-23FF Miscellaneous Technical
⍂ U+2342 9026 [U2300.pfd] 2300-23FF Miscellaneous Technical
⍃ U+2343 9027 [U2300.pfd] 2300-23FF Miscellaneous Technical

   cu 16b2339  NB. same as: cu '⌹'
⌹ U+2339 9017 [U2300.pfd] 2300-23FF Miscellaneous Technical

   cu 9017     NB. same as: cu '⌹'
⌹ U+2339 9017 [U2300.pfd] 2300-23FF Miscellaneous Technical

   cu 'π'      NB. the letter: π -- in either Greek or Mathematics
π U+03C0 960 [U0370.pfd] 0370-03FF Greek and Coptic

   cu 16b03c0  NB. same as: cu 'π'
π U+03C0 960 [U0370.pfd] 0370-03FF Greek and Coptic

   cu 960      NB. same as: cu 'π'
π U+03C0 960 [U0370.pfd] 0370-03FF Greek and Coptic

   cu '⍺⍵∊⍳⍴'  NB. APL primitives
⍺ U+237A 9082 [U2300.pfd] 2300-23FF Miscellaneous Technical
⍵ U+2375 9077 [U2300.pfd] 2300-23FF Miscellaneous Technical
∊ U+220A 8714 [U2200.pfd] 2200-22FF Mathematical Operators
⍳ U+2373 9075 [U2300.pfd] 2300-23FF Miscellaneous Technical
⍴ U+2374 9076 [U2300.pfd] 2300-23FF Miscellaneous Technical

   cu 'αωειρ'  NB. Greek letters
α U+03B1 945 [U0370.pfd] 0370-03FF Greek and Coptic
ω U+03C9 969 [U0370.pfd] 0370-03FF Greek and Coptic
ε U+03B5 949 [U0370.pfd] 0370-03FF Greek and Coptic
ι U+03B9 953 [U0370.pfd] 0370-03FF Greek and Coptic
ρ U+03C1 961 [U0370.pfd] 0370-03FF Greek and Coptic

   cu '~'      NB. the ascii letter: ~ (=U+007E)
~ U+007E 126 [U0000.pfd] 0000-007F ASCII

   cu 16b007e  NB. same as: cu '~'
~ U+007E 126 [U0000.pfd] 0000-007F ASCII

   cu 126      NB. same as: cu '~'
~ U+007E 126 [U0000.pfd] 0000-007F ASCII

Note that APL primitives and the corresponding Greek letters have different code points.

Thus, to identify an unknown Unicode letter from any source (e.g. Wikipedia, or a given PDF document), copy the letter into the Clipboard and paste it between the single-quotes of the expression: cu ''

For example, suppose you see µF (microfarads) in an engineering document. Is it the correct µ or the incorrect μ?

Unlike pi (π) which only occurs once (in the Greek alphabet), mu (µ) occurs twice in the plane-zero codespace (Range: 0000–FFFF).

   cu 'µ'
µ U+00B5 181 [U0080.pfd] 0080-00FF C1 Controls and Latin-1 Supplement
   cu 'μ'
μ U+03BC 956 [U0370.pfd] 0370-03FF Greek and Coptic

-- Ian Clark <<DateTime(2013-04-14T07:35:55Z)>>