Vocabulary/uco

From J Wiki
Jump to: navigation, search

>> <<   Down to: Dyad   Back to: Vocabulary Thru to: Dictionary

u: y Unicode

Rank Infinity -- operates on x and y as a whole -- WHY IS THIS IMPORTANT?



The unicode character corresponding to y.

If y is a number,  u: y returns the unicode character having y as its unicode code point (UCP).

   y=: 16b03c0   NB. The UCP: U+03C0 of the symbol: π
   u: y
π

If y is a character,  u: y converts y to unicode precision.

If y is already unicode characters,  u: y is the same as y. If y is bytes, each byte is extended with high-order zeros to unicode precision. This does not change the meaning of ASCII characters, because the ASCII standard is embedded in the Unicode standard, thus the character 'A' is ASCII character number 41 (hex) and the UCP for 'A' is U+0041.

   ] z=: u: y
π
   datatype z
unicode
   3!:0 z
131072
   NB. Compare with...
   3!:0 'A'
2
   3!:0 (65)
4

Common uses

1. Display the character with UCP: 960

   u: 960   NB. (mathematical) pi
π

2. Make a unicode atom of UCP: 960

   #$ pi=: u: 960
0
   datatype pi
unicode

   ] z=: 'C=2' , pi , 'r'
C=2πr
   datatype z
unicode

More Information


Character Precisions

The character type comprises two precisions: byte precision and unicode precision.

An atom with byte precision has one of the 256 different byte values, which are all listed in the primitive noun a. .

  • Byte indexes 0-127 are the ASCII characters (described by the ASCII standard).
  • Byte indexes 128-255 do not correspond to characters, but are used for representing data in byte form (as when interacting with external hardware and software).

In other words, byte precision has two different uses:

  • to represent ASCII characters
  • to hold general 8-bit data.

An atom with unicode precision, also known as a wide character or a 16-bit character, is a unicode character, i.e. a character described by the Unicode standard.


UTF-8 encoding

UTF-8 is a widely used method of encoding Unicode characters in a list of bytes.

Taking advantage of the fact that there are 256 byte values, only 128 of which are used by ASCII, it assigns meaning to the other 128 byte values and encodes each non-ASCII Unicode character in a string of bytes.

UTF-8 is not a character precision in J. It is an encoding scheme for nouns having the precision: byte.

J's only support for UTF-8 is to support conversion between UTF-8 bytes and the J precisions byte and unicode. See table below.


Examples

1. Display characters represented in the (obsolescent) Latin 1 standard.

Example: consider the French word: 'ça'. On pre-Unicode platforms (e.g. Windows XP) this would be stored with a single byte for each character, i.e. as two Latin 1 codes in byte precision: (231 97).

   u: 231 97  NB. unicode characters display correctly
ça
   231 97 { a.   NB. non-ASCII bytes do not display
�a
   231 97 { u:a.
ça
   u: 231 97 { a.
ça

2. Use in tacit verbs in conjunction with dyadic (u:)

  • If y is characters, u:y is equivalent to 2 u: y
  • If y is numbers, u:y is equivalent to 4 u: y

3. If y is a number, it must be in the range _65536 to 65535, and the UCP will be (65536 | y).


Details

1. y may have any rank. The shape of u: y is the same as the shape of y.


x u: y Unicode

Rank Infinity -- operates on x and y as a whole -- WHY IS THIS IMPORTANT?



Converts between numbers, character precisions and encodings according to the Unicode and UTF-8 standards.

x u: y functions
Description x Type/

Precision
of Result

Type/

Precision
of y

Action
Truncate to byte precision (ouch!) 1 byte byte Leave unchanged
unicode Discard upper bits
Expand to unicode 2 unicode byte Extend with high-order 0 bits
unicode Leave unchanged
Convert to integer 3 integer byte Convert to byte number (index in a.)
unicode Convert to number of UCP
UCP to unicode 4 unicode integer Create unicode character whose UCP is y
Shrink to byte precision 5 byte byte Leave unchanged
unicode Discard upper bits, but give error if any nonzeros would be discarded
Convert external 2-byte characters 6 unicode byte Convert each pair of bytes into a unicode character. The bytes are in little-endian order.
Convert unicode/UTF-8 to smallest precision needed to hold y 7 byte byte, ASCII (all byte indexes < 128) Leave unchanged
unicode, all UCPs < 128 Discard upper bits
unicode unicode, some UCPs > 127 Leave unchanged
byte, some non-ASCII (some byte indexes > 127) Decode as UTF-8
Convert to UTF-8 8 byte byte Leave unchanged
unicode Convert to UTF-8 encoded bytes

Common uses

1. Find the unicode code point (UCP) (as a decimal numeral) of (a pasted) glyph: y

   cp=: 3 u: 7 u: ]
   cp 'π'            NB. glyph pasted between apostrophes: ''
960

More Information

1. Use these factory verbs for conversions:

  • ucp — converts byte to unicode, but not if it's ascii-only!
  • uucp — converts byte to unicode, even if it's ascii-only
  • ucpcount — reliably counts the characters in a string of either precision
  • utf8 — converts unicode to byte, turning non-ascii characters into multi-byte substrings.
   ucp_z_
7&u:
   uucp_z_  NB. Convert UTF-8 to char/unicode, then convert char/unicode to unicode
u:@(7&u:)
   ucpcount_z_
#@(7&u:)
   utf8_z_  NB. Convert char/unicode to bytes, using UTF-8 if needed
8&u:

ucp, uucp, ucpcount, utf8 are

  • Standard Library word(s) residing in the 'z'-locale.
  • Defined in the factory script stdlib.ijs.
  • View definition(s) by entering in the J session:  open'stdlib'

Use uucp to convert a string to unicode precision, do any text manipulation, then use utf8 to convert back to byte precision (e.g. for output).

2. If y is empty in  7 u: y , the result is an empty list of byte precision.

3. In  7 u: y and  8 u: y , y must be an atom or a list. In  6 u: y , y must not be an atom. Otherwise, y may have any rank.

4. The result of  x u: y for  x e. 1 2 3 4 5 has the same shape as y.

  • For  6 u: y , each row of y must have an even number of bytes, and the rows of the result have half that length.
  • For  7 u: y , the result is a list except that it is an atom if y is a unicode atom.
  • For 8 u: y the result is a list if y is unicode, otherwise it has the same shape as y.