Jd/Overview

From J Wiki
Jump to navigation Jump to search
Jd | Overview | General | Docs | Ops | Admin | Server | Replicate | Guide | Technical | Release | License | Support

Overview

Yet another database. Why might you want one? Read on.

Jd is a commercial database product from Jsoftware. It has similarities with all relational database management systems (RDBMS). Although similar in terminology and features to MySQL, Oracle, DB2, SQL Server, and others, it is closer in spirit and design to Kx's kdb+, Vectornova's Vstar, and all old APL based financial systems that ran on IBM mainframes in the 70s and 80s.

Similarity

Jd has one or more databases. A database has one or more tables. A table is rows (unnamed) and columns (named). Data in a column is of the same type (integer, float, datetime, fixed length characters, variable length characters, etc.). Tables can be joined with other tables to function as if they were that large table. Rows and columns can be retrieved by queries on the joined tables. Retrieved data from a query can be aggregated, grouped, and sorted. Rows can be deleted, updated, and inserted.

The Jd API is SQL like. Similar enough that if you are a guru in another RDBMS you'll pick up the Jd way of thinking without much trouble. And if you're new, then Jd with its tutorials and examples will give you a leg up on getting at your data.

J-integrated

The key difference between Jd and most other database systems is that Jd comes with a fully integrated and mature programming language. Jd is implemented in J and lives openly and dynamically in the J execution and development environment. Jd is a natural extension of J and the full power of J is available to the Jd database application developer. The integration is not just available to you, it is unabashedly pushed to you for exploitation.

J is a modern, array-oriented, high-performance, general-purpose programming language. In J the natural unit of information is a list (of booleans or integers or floats or datetimes or fixed length character strings or variable length character strings or etc). In J, the list of all the license numbers is a noun (named data object). And that noun can be mapped (memory mapped files) directly to a file in the file system by the host OS. So the Jd table column of license numbers is a single file. J can map that file to single J noun and work directly on all that data. Jd table column -> file -> J noun. It doesn't get simpler or more efficient than this.

Early adopters of Jd are assumed to have a J background and documentation and tutorials depend on that background.

End users of an application built on Jd, or a web service based on Jd, or reports generated by Jd, need no knowledge of J. Jd developers, maintainers, and administrators need some knowledge of J and an overview of how a J system is installed and managed.

Selling Jd to someone who has data problems and no knowledge of J is a tough sell. Or, on the positive side: if you know J and have data problems, boy do we have what you want!

Columnar

Jd is a columnar (column oriented) RDBMS.

Most RDBMS systems are row oriented. Ages ago they fell into the trap of thinking of tables as rows (records). You can see how this happened. The end user wants the record that has a first name, last name, license, make, model, color, and date. So a row was the unit of information and rows were stored sequentially on disk. Row orientation works for small amounts of data. But think about what happens when there are lots of rows and the user wants all rows where the license starts with 123 and the color is blue or black. In a naive system the application has to read every single byte of data from the disk. There are lots of bytes and reading from disk is, by orders of magnitude, the slowest part of the performance equation. To answer this simple question all the data had to be read from disk. This is a performance disaster and that is where decades of adding bandages and kludges started.

Jd is columnar so the data is 'fully inverted'. This means all of the license numbers are stored together and sequentially on disk. The same for all the other columns. Think about the earlier query for license and color. Jd gets the license numbers from disk (a tiny fraction of the database) and generates a boolean mask of rows that match. It then gets the color column from disk (another small fraction of the data) and generates a boolean mask of matches and ANDS that with the other mask. It can now directly read just the rows from just the columns that are required in the result. Only a small fraction of the data is read. In J, columns used in queries are likely already in memory and the query runs at ram speed, not the sad and slow disk speed.

Both scenarios above are simplified, but the point is strong and valid. The end user thinks in records, but the work to get those records is best organized by columns.

Row oriented is slavishly tied to the design ideas of filing cabinets and manila folders. Column oriented embraces computers.

A table column is a mapped file.

Analytics

Jd is particularly suited to analytics. It works well with large tables (millions of rows to billions), multiple tables connected by joins, structured data, numerical data, and complex queries and aggregations.

Jd performance is affected primarily by ram. Lots of ram allows lots of rows. Performance is relatively unaffected by the number of columns and many 100s of columns are fine. It is the number of rows and not the byte size of the database that counts.

Jd supports a reasonable rate of individual record updates/inserts, but is oriented towards fast, customized, bulk updates. That is, periodic (hourly/daily/monthly) bulk loads of lots of new records, perhaps from csv files dumped from a transactional database system, that are customized (ETL) are fast and easy.

Jd can serve dozens of simultaneous users of a database in their browsers with high-security and managed access out of the box. This works well for a team of analysts. To serve hundreds or thousands of users requires additional work to put the Jd server behind an apache/nginx/etc front end.

Jd has a small footprint, is easy to install, and multiple installations are encouraged. An analyst can run Jd on their own computer to work with local databases. A group of analysts can access Jd on a server that provides shared access to the same databases. A Jd instance can serve a single database or multiple databases. A machine can run one or more Jd instances. This allows easy distribution and scalability.

Jd is strong in analytics, but weaker in other areas. Jd is probably not the right choice for big data (terabytes of unstructured textual data) or for a high-rate transactional system with 1000s of users.

Admin/backup

Databases/tables/columns/metadata/etc are represented directly in the file structure. A database is a folder. A table is a folder in the database. A column is a folder in the table. The data is one or more files in the column.

You can backup a database or a table with standard host shell scripts, file copy commands, and tools such as tar/gzip/zip. Jd will provide patterns for standard requirements and you can easily roll your own customized version. If you understand backing up file folders, then you pretty much understand backing up Jd databases. You can also dump a database as csv files and restore a database from csv files. Metadata such as column names and types, user administration data (who can do what), and all other database specific information is automatically included.

Admin can control user access to a fine degree. Access requires a user/pswd and can be restricted to certain ops.

Fast CSV

Jd has a fast CSV loader. The process is managed in J, but the high-performance core is written in C. The loader can handle all reasonable CSV options, has good error reporting, and can get data from external sources into Jd quickly. Data from new csv files (for example, new hourly or daily data) can be quickly loaded and appended to your existing production Jd tables.

Jd also has a fast CSV unloader with the core written in C.

Open source

Jd source is largely J code and that code is open and available to licensed users. There is a binary shared library (to get high performance at bottlenecks).