Vocabulary/UnicodeCodePoint

From J Wiki
Jump to: navigation, search

Back to: Vocabulary

Unicode Code Point (UCP)

A Unicode Code Point (UCP) is a number in the code space covered by the Unicode standard. This attempts to define a universal character set for computers. The home website for this standard is Unicode.org.

Originally the UCP was a number (0 to 65535), i.e. representable by a 16-bit code. But later on further frames were added. The original code space (0 to 65535), which is (i.2^16), has been renamed Frame 0.

J has two datatypes for UCPs: unicode precision for Frame 0 unicode characters, and unicode4 precision which can store any UCP. Byte-precision characters are still available to store ASCII characters and general bytes for interacting with external hardware and software. We will call a noun with one of the extended precisions a unicode, just as we call a noun with the simple 8-bit precision a byte.

The Unicode.org convention for a UCP is to show it not as a decimal number, e.g. 960 but as a string based on its hexadecimal representation, viz. U+03C0.

U+03C0 is the UCP for the symbol pi (π) as you'll find it in most mathematical texts on the web. It also happens to be the Greek letter π -- but this double-usage cannot be taken for granted. For instance, it is not true for the engineering symbol µ -- which is not the Greek letter μ but a special character from a heritage 8-bit character set, now called Latin 1.


To look up a given UCP at Unicode.org

Suppose you want to look up the UCP U+03C0 at Unicode.org, to find its glyph, plus the writing system or character set it belongs to.

Go to the webpage titled Unicode 6.3 Character Code Charts. Near the top you'll see a field labelled Find chart by hex code:

Enter the hex code 03C0 (upper- or lowercase will do) and click Go (or press Enter). This will display a choice of links. Click the first, which (currently) downloads a file: U0370.pdf titled Greek and Coptic / Range: 0370–03FF. There you will discover π (U+03C0) in column 03C, row 0.


Using a given UCP in a J noun

Once you've discovered your symbol, and have opened its code chart U0370.pdf, you will find that the glyphs as displayed can be copy/pasted into other documents, including the J session (IJX) or a J window (IJS).

For instance you can paste your copied symbol into a J string to represent the well-known mathematical formula for the circumference of a circle: C = 2 π r

   ] z=: 'C=2πr'
C=2πr
   datatype z  NB. shows the precision of z.  "literal" means the same as "byte"
literal
   $z
6

WARNING: As you see above, z does not automatically become a unicode simply because the symbol π has been pasted into it. Rather it stays as a byte, just as it would if you omitted π.

Notice also that z contains 6 atoms, not 5, as you'd expect from counting the glyphs in the formula. The reason is because π occupies two-byte atoms, not one.

   3{.z
C=2
   4{.z
C=2�
   5{.z
C=2π
   6{.z
C=2πr

But how can a non-ASCII (or non-Latin 1) symbol such as π be stored as a list of bytes? The answer is, by encoding the byte-list in the utf-8 standard. The J session, and the IJS window, always use this standard when displaying a unicode symbol.

A utf-8 string is always bytes, not unicode. Each ASCII character resides in just 1 byte, but a UCP outside the ASCII code space occupies from 2 to 4 successive bytes. The symbol π happens to occupy 2 bytes. You can see this clearly if you box each atom of z ...

   <"0 z
+-+-+-+-+-+-+
|C|=|2|�|�|r|
+-+-+-+-+-+-+

Bug: an invalid UTF-8 sequence (viz. � here) corrupts the box structure.

How can J distinguish a utf-8 encoded symbol (u-symbol) from an ordinary ASCII character?

The utf-8 standard ensures that the first byte code of the u-symbol lies outside the ASCII code-space, viz. by holding a value greater than 127. Let us call such an 8-bit code a superascii.

When displaying bytes, whenever J encounters a superascii it assumes it starts a u-symbol. A consequence of this is that J can no longer display superasciis as if they were characters from Latin 1. Instead all superasciis which can't be decoded as utf-8 are shown with a placeholder: the non-displayable character: �. Thus when individually boxed, or extracted using { or {. , the first character (and often all the characters) of a u-symbol appear as �.

If you want to make a unicode noun, not bytes, to hold the u-symbol π, then you must explicitly convert the (utf-8 encoded) bytes to a unicode using the J primitive (u:), together with the appropriate x-argument, in this case 7.

   ] zz=: 7 u: z
C=2πr
   $zz
5
   datatype zz
unicode

Notice that zz now consists of 5 atoms, as you'd originally hoped for, π being represented by a single atom. You can see this clearly if you box each atom of zz ...

   <"0 zz
+-+-+-+-+-+
|C|=|2|π|r|
+-+-+-+-+-+

Just as numbers having different precisions can be combined under addition, etc., the result having the highest of the precisions, so unicode and bytes can be combined using (,). Thus:

   ] pi=: u: 960
π
   datatype pi
unicode
   datatype each 'C=2' ; 'r'  NB. "literal" means "byte"
+-------+-------+
|literal|literal|
+-------+-------+
   ] zzz=: 'C=2' , pi , 'r'
C=2πr
   datatype zzz
literal2