Vocabulary/uco

>> << Down to: Dyad Back to: Vocabulary Thru to: Dictionary

`u: y`	Unicode

Rank Infinity -- operates on x and y as a whole -- WHY IS THIS IMPORTANT?

The unicode character corresponding to y.

If y is a number, u: y returns the unicode character having y as its unicode code point (UCP).

   y=: 16b03c0   NB. The UCP: U+03C0 of the symbol: π
   u: y
π

If y is a character, u: y converts y to unicode precision.

If y is already unicode characters, u: y is the same as y. If y is bytes, each byte is extended with high-order zeros to unicode precision. This does not change the meaning of ASCII characters, because the ASCII standard is embedded in the Unicode standard, thus the character 'A' is ASCII character number 41 (hex) and the UCP for 'A' is U+0041.

   ] z=: u: y
π
   datatype z
unicode
   3!:0 z
131072
   NB. Compare with...
   3!:0 'A'
2
   3!:0 (65)
4

Common uses

1. Display the character with UCP: 960

   u: 960   NB. (mathematical) pi
π

2. Make a unicode atom of UCP: 960

   #$ pi=: u: 960
0
   datatype pi
unicode

   ] z=: 'C=2' , pi , 'r'
C=2πr
   datatype z
unicode

More Information

Character Precisions

The character type comprises 3 precisions: byte precision, unicode precision, and unicode4 precision.

An atom with byte precision has one of the 256 different byte values, which are all listed in the primitive noun a. .

Byte indexes 0-127 are the ASCII characters (described by the ASCII standard).
Byte indexes 128-255 do not correspond to characters, but are used for representing data in byte form (as when interacting with external hardware and software).

In other words, byte precision has two different uses:

to represent ASCII characters
to hold general 8-bit data.

An atom with unicode precision, also known as a wide character or a 16-bit character, is a unicode character, i.e. a character described by the Unicode standard in the range U+0000 to U+FFFF.

An atom with unicode4 precision, also known as a 4-byte character, is a unicode character, i.e. a character described by the Unicode standard in the range U+0000 to U+10FFFF. Unicode4 characters can encode all the characters in Chinese/Japanese/Korean (CJK) character sets.

UTF-8 encoding

UTF-8 is a widely used method of encoding Unicode characters in a list of bytes.

Taking advantage of the fact that there are 256 byte values, only 128 of which are used by ASCII, it assigns meaning to the other 128 byte values and encodes each non-ASCII Unicode character in a string of bytes.

UTF-8 is not a character precision in J. It is an encoding scheme for nouns having the precision: byte.

J's only support for UTF-8 is to support conversion between UTF-8 bytes and the J precisions byte and unicode. See table below.

Examples

1. Display characters represented in the (obsolescent) Latin 1 standard.

Example: consider the French word: 'ça'. On pre-Unicode platforms (e.g. Windows XP) this would be stored with a single byte for each character, i.e. as two Latin 1 codes in byte precision: (231 97).

   u: 231 97  NB. unicode characters display correctly
ça
   231 97 { a.   NB. non-ASCII bytes do not display
�a
   231 97 { u:a.
ça
   u: 231 97 { a.
ça

2. Use in tacit verbs in conjunction with dyadic (u:)

If y is characters, u:y is equivalent to 2 u: y
If y is numbers, u:y is equivalent to 4 u: y

3. If y is a number, it must be in the range _65536 to 65535, and the UCP will be (65536 | y).

Details

1. y may have any rank. The shape of u: y is the same as the shape of y.

`x u: y`	Unicode

Rank Infinity -- operates on x and y as a whole -- WHY IS THIS IMPORTANT?

Converts between numbers, character precisions and encodings according to the Unicode and UTF-8 standards.

`x u: y` functions
Description	`x`	Type/ Precision of Result	Type/ Precision of `y`	Action
Truncate to byte precision (ouch!)	`1`	byte	byte	Leave unchanged
Truncate to byte precision (ouch!)	`1`	byte	unicode or unicode4	Discard upper bits
Convert to unicode (2-byte) precision	`2`	unicode	byte	Extend with high-order `0` bits
			unicode	Leave unchanged
			unicod4	Discard 2 high bytes
Convert to integer	`3`	integer	byte	Convert to byte number (index in `a.`)
Convert to integer	`3`	integer	unicode or unicode4	Convert to number of UCP
UCP to unicode	`4`	unicode	integer in (-65536,65535)	Create unicode character whose UCP is `y`
Shrink to byte precision	`5`	byte	byte	Leave unchanged
Shrink to byte precision	`5`	byte	unicode	Discard upper bits, but give error if any nonzeros would be discarded
Convert external 2-byte characters	`6`	unicode	byte	Convert each pair of bytes into a unicode character. The bytes are in little-endian order.
Convert unicode/unicode4/UTF-8 to smallest precision needed to hold `y`	`7`	byte	byte, ASCII (all byte indexes < 128)	Leave unchanged
		byte	unicode or unicode4, all UCPs < 128	Discard upper bits
		unicode	byte, some non-ASCII (some byte indexes > 127)	Convert to unicode with high-order zeros
			unicode, some UCPs > 127	Leave unchanged
			unicode4, some UCPs > 127	Convert to unicode. UCPs in the range (16b10000,16b10ffff) are represented in the result by a surrogate pair of 2 unicode characters
			integer in (0,16b10ffff)	Convert to unicode. UCPs in the range (16b10000,16b10ffff) are represented in the result by a surrogate pair of 2 unicode characters
Convert to UTF-8	`8`	byte	byte	Leave unchanged
Convert to UTF-8	`8`	byte	unicode, unicode4 or integer in (0,16b10ffff)	Convert to UTF-8 encoded bytes
Convert to unicode4 unless all characters are ASCII	`9`	byte		Leave unchanged
Convert to unicode4 unless all characters are ASCII	`9`	unicode4	any character precision containing a UCPs > 127; or integer in (0,16b10ffff)	Convert to unicode4. Any UTF-8 is converted to unicode4, and surrogate pairs in unicode are converted (at rank 1).
Convert to unicode4	`10`	unicode4	any character precision, or integer in (0,16b10ffff)	Convert to unicode4 (at rank 0)

Common uses

1. Find the unicode code point (UCP) (as a decimal numeral) of (a pasted) glyph: y

   cp=: 3 u: 7 u: ]
   cp 'π'            NB. glyph pasted between apostrophes: ''
960

More Information

1. Use these factory verbs for conversions:

ucp — converts byte to unicode, but not if it's ascii-only!
uucp — converts byte to unicode, even if it's ascii-only
ucpcount — reliably counts the characters in a string of either precision
utf8 — converts unicode to byte, turning non-ascii characters into multi-byte substrings.

   ucp_z_
7&u:
   uucp_z_  NB. Convert UTF-8 to char/unicode, then convert char/unicode to unicode
u:@(7&u:)
   ucpcount_z_
#@(7&u:)
   utf8_z_  NB. Convert char/unicode to bytes, using UTF-8 if needed
8&u:

ucp, uucp, ucpcount, utf8 are

Standard Library word(s) residing in the 'z'-locale
Defined in the factory script stdlib.ijs which is located in ~system/main/stdlib.ijs
View the definition(s) in a JQt session by entering: open '~system/main/stdlib.ijs'

Use uucp to convert a string to unicode precision, do any text manipulation, then use utf8 to convert back to byte precision (e.g. for output).

2. If y is empty in 7 u: y , the result is an empty list of byte precision.

3. In 7 u: y and 8 u: y , y must be an atom or a list. In 6 u: y , y must not be an atom. Otherwise, y may have any rank.

4. The result of x u: y for x e. 1 2 3 4 5 has the same shape as y.

For 6 u: y , each row of y must have an even number of bytes, and the rows of the result have half that length.
For 7 u: y , the result is a list except that it is an atom if y is a unicode atom.
For 8 u: y the result is a list if y is unicode, otherwise it has the same shape as y.

Vocabulary/uco

Common uses

More Information

Character Precisions

UTF-8 encoding

Examples

Details

Common uses

More Information

Navigation menu

Search