JBoss.orgCommunity Documentation

Chapter 9. Federated Planning

9.1. Overview
9.2. Federated Optimizations
9.2.1. Access Patterns
9.2.2. Pushdown
9.2.3. Dependent Joins
9.2.4. Copy Criteria
9.2.5. Projection Minimization
9.2.6. Partial Aggregate Pushdown
9.2.7. Optional Join
9.2.8. Standard Relational Techniques
9.3. Federated Failure Modes
9.3.1. Partial Results
9.4. Query Plans
9.4.1. Getting a Query Plan
9.4.2. Analyzing a Query Plan
9.4.3. Relational Plans
9.5. Query Planner
9.5.1. Relational Planner
9.5.2. Procedure Planner
9.5.3. XML Planner
9.5.4. XQuery Planner

Teiid at its core is a federated relational query engine. This query engine allows you to treat all of your data sources as one virtual database and access them in a single SQL query. This allows you to focus on building your application, not on hand-coding joins, and other relational operations, between data sources.

When the query engine receives an incoming SQL query it performs the following operations:

The logical query plan is a tree of operations used to transform data in source tables to the expected result set. In the tree, data flows from the bottom (tables) to the top (output). The primary logical operations are select (select or filter rows based on a criteria), project (project or compute column values), join , source (retrieve data from a table), sort (ORDER BY), duplicate removal (SELECT DISTINCT), group (GROUP BY), and union (UNION).

For example, consider the following query that retrieves all engineering employees born since 1970.


Logically, the data from the Employees and Departments tables are retrieved, then joined, then filtered as specified, and finally the output columns are projected. The canonical query plan thus looks like this:

Data flows from the tables at the bottom upwards through the join, through the select, and finally through the project to produce the final results. The data passed between each node is logically a result set with columns and rows.

Of course, this is what happens logically , not how the plan is actually executed. Starting from this initial plan, the query planner performs transformations on the query plan tree to produce an equivalent plan that retrieves the same results faster. Both a federated query planner and a relational database planner deal with the same concepts and many of the same plan transformations. In this example, the criteria on the Departments and Employees tables will be pushed down the tree to filter the results as early as possible.

In both cases, the goal is to retrieve the query results in the fastest possible time. However, the relational database planner does this primarily by optimizing the access paths in pulling data from storage.

In contrast, a federated query planner is less concerned about storage access because it is typically pushing that burden to the data source. The most important consideration for a federated query planner is minimizing data transfer.

A special optimization called a dependent join is used to reduce the rows returned from one of the two relations involved in a multi-source join. In a dependent join, queries are issued to each source sequentially rather than in parallel, with the results obtained from the first source used to restrict the records returned from the second. Dependent joins can perform some joins much faster by drastically reducing the amount of data retrieved from the second source and the number of join comparisons that must be performed.

The conditions when a dependent join is used are determined by the query planner based on access patterns, hints, and costing information.

Teiid supports the MAKEDEP and MAKENOTDEP hints. Theses are can be placed in either the OPTION clause or directly in the FROM clause . As long as all can be met, the MAKEDEP and MAKENOTDEP hints override any use of costing information.

Tip

The MAKEDEP hint should only be used if the proper query plan is not chosen by default. You should ensure that your costing information is representative of the actual source cardinality. An inappropriate MAKEDEP hint can force an inefficient join structure and may result in many source queries.

Teiid provides the capability to obtain "partial results" in the event of data source unavailability. This is especially useful when unioning information from multiple sources, or when doing a left outer join, where you are 'appending' columns to a master record but still want the record if the extra info is not available.

If one or more data sources are unavailable to return results, then the result set obtained from the remaining available sources will be returned. In the case of joins, an unavailable data source essentially contributes zero tuples to the result set.

A source is considered to be 'unavailable' if the connector binding associated with the source issues an exception in response to a query. The exception will be propagated to the query processor, where it will become a warning in the result set.

For each source that is excluded from a query, a warning will be generated describing the source and the failure. These warnings can be obtained from the Statement.getWarnings() method. This method returns a SQLWarning object but in the case of "partial results" warnings, this will be an object of type com.metamatrix.jdbc.api.PartialResultsWarning. This class can be used to obtain a list of all the failed connectors by name and to obtain the specific exception thrown by each connector.


When integrating information using a federated query planner, it is useful to be able to view the query plans that are created, to better understand how information is being accessed and processed, and to troubleshoot problems.

A query plan is a set of instructions created by a query engine for executing a command submitted by a user or application. The purpose of the query plan is to execute the user's query in as efficient a way as possible.

Relational plans represent the actually processing plan that is composed of nodes that are the basic building blocks of logical relational operations. Physical relational plans differ from logical relational plans in that they will contain additional operations and execution specifics that were chosen by the optimizer.

The nodes for a relational query plan are:

Every node has a set of statistics that are output. These can be used to determine the amount of data flowing through the node.

Statistic

Description

Units

Node Output Rows

Number of records output from the node

count

Node Process Time

Time processing in this node only

millisec

Node Cumulative Process Time

Elapsed time from beginning of processing to end

millisec

Node Cumulative Next Batch Process Time

Time processing in this node + child nodes

millisec

Node Next Batch Calls

Number of times a node was called for processing

count

Node Blocks

Number of times a blocked exception was thrown by this node or a child

count

In addition to node statistics, some nodes display cost estimates computed at the node.

Cost Estimates

Description

Units

Estimated Node Cardinality

Estimated number of records that will be output from the node; -1 if unknown

count

For each sub-command in the user command an appropriate kind of sub-planner is used (relational, XML, XQuery, procedure, etc).

Each planner has three primary phases:

The GenerateCanonical class generates the initial (or “canonical” plan).  This plan is based on the typical logical order that a SQL query gets executed.  A SQL select query has the following possible clauses (all but SELECT are optional):  SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT.  These clauses are logically executed in the following order:

These clause translate into the following types of planning nodes:

There is also a Null Node that can be created as the result of rewrite or planning optimizations. It represents a node that produces no rows

Relational optimization is based upon rule execution that evolves the initial plan into the execution plan.  There are a set of pre-defined rules that are dynamically assembled into a rule stack for every query.  The rule stack is assembled based on the contents of the user’s query and its transformations.  For example, if there are no virtual layers, then RuleMergeVirtual, which merges virtual layers together, is not needed and will not be added to the stack.  This allows the rule stack to reflect the complexity of the query.

Logically the plan node data structure represents a tree of nodes where the source data comes up from the leaf nodes (typically Access nodes in the final plan), flows up through the tree and produces the user’s results out the top.  The nodes in the plan structure can have bidirectional links, dynamic properties, and allow any number of child nodes.  Processing plan nodes in contrast typical have fixed properties, and only allow for binary operations - due to algorithmic limitations.

Below are some of the rules included in the planner:

  • RuleRemoveSorts - removes sort nodes that do not have an effect on the result.  This most common when a view has an non-limited ORDER BY.

  • RulePlaceAccess - insert an Access node above every physical Source node.  The source node represents a table typically.  An access node represents the point at which everything below the access node gets pushed to the source.  Later rules focus on either pushing stuff under the access or pulling the access node up the tree to move more work down to the data sources.  This rule is also responsible for placing .

  • RulePushSelectCriteria - pushes select criteria down through unions, joins, and views into the source below the access node.  In most cases movement down the tree is good as this will filter rows earlier in the plan.  We currently do not undo the decisions made by PushSelectCriteria.  However in situations where criteria cannot be evaluated by the source, this can lead to sub optimal plans.

    One of the most important optimization related to pushing criteria, is how the criteria will be pushed trough join.  Consider the following plan tree that represents a subtree of the plan for the query "select ... from A inner join b on (A.x = B.x) where A.y = 3"

              SELECT (B.y = 3)
               |
              JOIN - Inner Join on (A.x = B.x
             /     \    
          SRC (A)   SRC (B)

    Note: SELECT nodes represent criteria, and SRC stands for SOURCE.

    It is always valid for inner join and cross joins to push (single source) criteria that are above the join, below the join.  This allows for criteria originating in the user query to eventually be present in source queries below the joins.  This result can be represented visually as:

              JOIN - Inner Join on (A.x = B.x)
              /    \
             /   SELECT (B.y = 3)
            |        |
          SRC (A)   SRC (B)

    The same optimization is valid for criteria specified against the outer side of an outer join.  For example:

              SELECT (B.y = 3) 
               |
              JOIN - Right Outer Join on (A.x = B.x)
             /     \    
          SRC (A)   SRC (B)

    Becomes

              JOIN - Right Outer Join on (A.x = B.x)
              /    \
             /   SELECT (B.y = 3)
            |        |
          SRC (A)   SRC (B)

    However criteria specified against the inner side of an outer join needs special consideration.  The above scenario with a left or full outer join is not the same.  For example:

              SELECT (B.y = 3)
               |
              JOIN - Left Outer Join on (A.x = B.x)
             /     \    
          SRC (A)   SRC (B)

    Can become (available only after 5.0.2):

              JOIN - Inner Join on (A.x = B.x)
              /    \
             /   SELECT (B.y = 3)
            |        |
          SRC (A)   SRC (B)

    Since the criterion is not dependent upon the null values that may be populated from the inner side of the join, the criterion is eligible to be pushed below the join – but only if the join type is also changed to an inner join.  

    On the other hand, criteria that are dependent upon the presence of null values CANNOT be moved.  For example:

              SELECT (B.y is null)
               |
              JOIN - Left Outer Join on (A.x = B.x)
             /     \   
          SRC (A)   SRC (B)

    This plan tree must have the criteria remain above the join, since the outer join may be introducing null values itself.  This will be true regardless of which version of Teiid is used.

  • RulePushNonJoinCriteria – this rule will push criteria out of an on clause if it is not necessary for the correctness of the join.

  • RuleRaiseNull – this rule will raise null nodes to their highest possible point.  Raising a null node removes the need to consider any part of the old plan that was below the null node.

  • RuleMergeVirtual - merges virtual layers together.  Virtual layers are connected by nesting canonical plans under source leaf nodes of the parent plan.  Each canonical plan is also sometimes referred to as a “query frame”.  Merge virtual attempts to merge child frames into the parent frame.   The merge involves renaming any symbols in the lower frame that overlap with symbols in the upper frame.  It also involves merging the join information together.

  • RuleRemoveOptionalJoins – removes optional join nodes form the plan tree as soon as possible so that planning will be more optimal.

  • RulePlanJoins – this rule attempts to find an optimal ordering of the joins performed in the plan, while ensuring that dependencies are met.  This rule has three main steps.  First it must determine an ordering of joins that satisfy the present.  Second it will heuristically create joins that can be pushed to the source (if a set of joins are pushed to the source, we will not attempt to create an optimal ordering within that set.  More than likely it will be sent to the source in the non-ANSI multi-join syntax and will be optimized by the database).  Third it will use costing information to determine the best left-linear ordering of joins performed in the processing engine.  This third step will do an exhaustive search for 6 or less join sources and is heuristically driven by join selectivity for 7 or more sources.

  • RuleCopyCriteria - this rule copies criteria over an equality criteria that is present in the criteria of a join.  Since the equality defines an equivalence, this is a valid way to create a new criteria that may limit results on the other side of the join (especially in the case of a multi-source join).  

  • RuleCleanCriteria - this rule cleans up criteria after all the other rules.

  • RuleMergeCriteria - looks for adjacent criteria nodes and merges them together.  It looks for adjacent identical conjuncts and removes duplicates.  

  • RuleRaiseAccess - this rule attempts to raise the Access nodes as far up the plan as possible.  This is mostly done by looking at the source’s capabilities and determining whether the operations can be achieved in the source or not.

  • RuleChooseDependent - this rule looks at each join node and determines whether the join should be made dependent and in which direction.  Cardinality, the number of distinct values, and primary key information are used in several formulas to determine whether a dependent join is likely to be worthwhile.  The dependent join differs in performance ideally because a fewer number of values will be returned from the dependent side.  Also, we must consider the number of values passed from independent to dependent side.  If that set is larger than the max number of values in an IN criteria on the dependent side, then we must break the query into a set of queries and combine their results.  Executing each query in the connector has some overhead and that is taken into account.  Without costing information a lot of common cases where the only criteria specified is on a non-unique (but strongly limiting) field are missed.  A join is eligible to be dependent if:

    1. there is at least one equi-join criterion, i.e. tablea.col = tableb.col

    2. the join is not a full outer join and the dependent side of the join is on the inner side of the join

    The join will be made dependent if one of the following conditions, listed in precedence order, holds:

    1. There is an unsatisfied access pattern that can be satisfied with the dependent join criteria

    2. The potential dependent side of the join is marked with an option makedep

    3. (4.3.2) if costing was enabled, the estimated cost for the dependent join (5.0+ possibly in each direction in the case of inner joins) is computed and compared to not performing the dependent join.  If the costs were all determined (which requires all relevant table cardinality, column ndv, and possibly nnv values to be populated) the lowest is chosen.

    4. If key metadata information indicates that the potential dependent side is not “small” and the other side is “not small” or (5.0.1) the potential dependent side is the inner side of a left outer join.

    Dependent join is the key optimization we use to efficiently process multi-source joins.

    Instead of reading all of source A and all of source B and joining them on A.x = B.x, we read all of A then build a set of A.x that are passed as a criteria when querying B.  In cases where A is small and B is large, this can drastically reduce the data retrieved from B, thus greatly speeding the overall query.

  • RuleChooseJoinStrategy – Determines the base join strategy.  Currently this is a decision as to whether to use a merge join rather than the default strategy, which is a nested loop join.  Ideally the choice of a hash join would also be evaluated here.  Also costing should be used to determine the strategy cost.  

  • - RuleCollapseSource - this rule removes all nodes below an Access node and collapses them into an equivalent query that is placed in the Access node.

  • RuleAssignOutputElements - this rule walks top down through every node and calculates the output columns for each node.  Columns that are not needed are dropped at every node.  This is done by keeping track of both the columns needed to feed the parent node and also keeping track of columns that are “created” at a certain node.

  • RuleValidateWhereAll - this rule validates a rarely used model option.

  • RuleAccessPatternValidation – validates that all access patterns have been satisfied.

The XML Planner creates an XML plan that is relatively close to the end result of the Procedure Planner – a program with instructions.  Many of the instructions are even similar (while loop, execute SQL, etc). Additional instructions deal with producing the output result document (adding elements and attributes).  

The XML planner does several types of planning (not necessarily in this order):

- Document selection - determine which tags of the virtual document should be excluded from the output document.  This is done based on a combination of the model (which marks parts of the document excluded) and the query (which may specify a subset of columns to include in the SELECT clause).  

- Criteria evaluation - breaks apart the user’s criteria, determine which result set the criteria should be applied to, and add that criteria to that result set query.

- Result set ordering - the query’s ORDER BY clause is broken up and the ORDER BY is applied to each result set as necessary

- Result set planning - ultimately, each result set is planned using the relational planner and taking into account all the impacts from the user’s query

- Program generation - a set of instructions to produce the desired output document is produced, taking into account the final result set queries and the excluded parts of the document.  Generally, this involves walking through the virtual document in document order, executing queries as necessary and emitting elements and attributes.

XML programs can also be recursive, which involves using the same document fragment for both the initial fragment and a set of repeated fragments (each a new query) until some termination criteria or limit is met.