Chapter 10. Architecture

JBoss.orgCommunity Documentation

Chapter 10. Architecture

10.1. Terminology

10.2. Data Management

10.2.1. Cursoring and Batching
10.2.2. Buffer Management
10.2.3. Cleanup

10.3. Query Termination

10.3.1. Canceling Queries
10.3.2. Timeouts

10.4. Processing

10.4.1. Join Algorithms
10.4.2. Sort Based Algorithms

10.1. Terminology

VM or Process – a processing node and Java VM. Now typically called a Process or “node” depending on context.
Host – a machine that is “hosting” one or more VMs.
Host controller – an app that runs on a host and can stop/start/control a VM
VM controller – the component of a VM that starts up the VM, connects to the distributed message bus and registry, and triggers the service controller to start all the services for this VM.
Service – a subsystem running in a VM (often in many VMs) and providing a related set of functionality
Registry – a distributed service registry (one instance in each VM and host controller) that shares the status of current system (existence of VMs and services), and provides remote access to other services via RMI
Service Controller – a component within the VM that the registry uses to start/stop/control services
Repository database – the repository database stores the system configuration and other service-specific information (sessions, users, entitlements, vdbs, etc).

In addition to these main components, the service platform provides a core set of services available to applications built on top of the service platform. These services are:

Session – the Session service manages active session information. Active sessions are stored in a distributed cache and shared between Session services in each VM. Sessions are also persisted in the server repository database.
Membership – the Membership service manages authentication, users, and groups. This was redesigned in the 5.5 release to provide primary support for LDAP authentication and authorization. Custom membership modules can allow be developed as needed.
Authorization – the Authorization service manages user entitlements. This service persists entitlements information in the repository database. Entitlements use is optional (as specified in the configuration) and off by default.

10.2. Data Management

10.2.1. Cursoring and Batching

Teiid cursors all results, regardless of whether they are from one source or many sources, and regardless of what type of processing (joins, unions, etc.) have been performed on the results.

Teiid processes results in batches. A batch is simply a set of records. The number of rows in a batch is determined by the buffer system properties Processor Batch Size (within query engine) and Connector Batch Size (created at connectors).

Client applications have no direct knowledge of batches or batch sizes, but rather specify fetch size. However the first batch, regardless of fetch size is always proactively returned to synchronous clients. Subsequent batches are returned based on client demand for the data. Pre-fetching is utilized at both the client and connector levels.

10.2.2. Buffer Management

The buffer manager manages memory for all result sets used in the query engine. That includes result sets read from a connector binding, result sets used temporarily during processing, and result sets prepared for a user. Each result set is referred to in the buffer manager as a tuple source.

When retrieving batches from the buffer manager, the size of a batch in bytes is estimated and then allocated against the max and session session limits. If a limit is exceeded and memory space cannot be cleared for the batch, then processing can optionally give up its timeslice and try again.

10.2.2.1. Memory Management

The buffer manager has two storage managers - a memory manager and a disk manager. The buffer manager maintains the state of all the batches, and determines when batches must be moved from memory to disk.

10.2.2.2. Disk Management

Each tuple source has a dedicated file (named by the ID) on disk. This file will be created only if at least one batch for the tuple source had to be swapped to disk. The file is random access. The connector batch size and processor batch size properties define how many rows can exist in a batch and thus define how granular the batches are when stored into the storage manager. Batches are NOT removed from the file when they are swapped back into memory because that would require removing data out of the middle of the file and updating all the indexes which would be very expensive. Thus the disk storage manager never removes a particular batch. Batches are always read and written from the storage manager whole.

The disk storage manager has a cap on the maximum number of open files to prevent running out of file handles. In cases with heavy buffering, this can cause wait times while waiting for a file handle to become available - customers may want to increase the number of open files allowed (a configuration parameter defaulted to 10).

10.2.3. Cleanup

When a tuple source is no longer needed, it is removed from the buffer manager. The buffer manager will remove it from both the memory storage manager and the disk storage manager. The disk storage manager will delete the file. In addition, every tuple source is tagged with a "group name" which is typically the session ID of the client. When the client's session is terminated (by closing the connection, server detecting client shutdown, or administrative termination), a call is sent to the buffer manager to remove all tuple sources for the session. This is a final cleanup mechanism that removes all state associated with a session.

In addition, when the query engine is shutdown, the buffer manager is shut down, which will remove all state from the disk storage manager and cause all files to be closed. In general, these mechanisms mean that the engine should always shut down with 0 open files. When the query engine is stopped, it is safe to delete any files in the buffer directory as they are not used across query engine restarts and must be due to a system crash where buffer files were not cleaned up.

10.3. Query Termination

10.3.1. Canceling Queries

If the client issues a ‘cancel’ command, then no results from the batch currently being processed in the server will be returned to the client.

When a query is canceled, processing will be stopped in the query engine and in all connectors involved in the query. The semantics of what a connector does in response to a cancellation command is dependent on the connector implementation. For example, JDBC connectors will asynchronously call cancel on the underlying JDBC driver, which may or may not actually support this method.

10.3.2. Timeouts

Timeouts in Teiid are managed on the client-side, in the JDBC API (which underlies both SOAP and ODBC access). Timeouts are only relevant for the first record returned. If the first record has not been received by the client within the specified timeout period, a ‘cancel’ command is issued to the server for the request and no results are returned to the client. The cancel command is issued by the JDBC API without the client’s intervention.

10.4. Processing

10.4.1. Join Algorithms

Nested loop does the most obvious processing – for every row in the outer source, it compares with every row in the inner source. Nested loop is only used when the join criteria has no equi-join predicates.

Merge join first sorts the input sources on the joined columns. You can then walk through each side in parallel (effectively one pass through each sorted source) and when you have a match, emit a row. Because the inputs are sorted, you can skip through large portions of the input without comparing if one side is less than the other. In general, merge join is on the order of n+m rather than n*m in nested loop. When n and m are large, this makes a huge difference. Merge join is the default algorithm. It cannot support full outer join or non-equality criteria, but other than that handles almost all common cases well.

Any of the Join Algorithms above can be made into a dependent join (however hash joins would need new logic). The decision to implement a dependent join is considered after the join algorithm is chosen, and does not currently influence the algorithm selection.

10.4.2. Sort Based Algorithms

Sorting is used as the basis of the Sort (ORDER BY), Grouping (GROUP BY), and DupRemoval (SELECT DISTINCT) operations. The sort algorithm is a multi-pass merge-sort that does not require all of the result set to ever be in memory yet uses the maximal amount of memory allowed by the buffer manager.

It consists of two phases. The first phase (“sort”) will take an unsorted input stream and produce one or more sorted input streams. Each pass reads as much of the unsorted stream as possible, sorts it, and writes it back out as a new stream. Since the stream may be more than can fit in memory, this may result in many sorted streams.

The second phase (“merge”) consists of a set of phases that grab the next batch from as many sorted input streams as will fit in memory. It then repeatedly grabs the next tuple in sorted order from each stream and outputs merged sorted batches to a new sorted stream. At completion of the pass, all input streams are dropped. In this way, each pass reduces the number of sorted streams. When only one stream remains, it is the final output.