Distributed Query Mode
(where the data are spread across multiple machines), the default execution plan is as follows:
- One machine will be selected as the execution hub, regardless of the number or distributed of starting point vertices.
- All the computation work for the query will take place at the execution hub. The vertex and edge data from other machines will be copied to the hub machine for processing.
TigerGraph v2.1 Enterprise offers a Distributed Query mode which provides a more optimized execution plan for queries which are likely to start at several machines and continue their traversal across several machines.
- A set of machines representing one full copy of the entire graph will participate in the query. If the cluster has a replication factor of 2 (so there are two copies of each piece of data), then half the machines will participate.
- The query executes in parallel across all the machines which have source vertex data for a given hop in the query. That is, each SELECT statement defines a 1-hop traversal from a set of source vertices to a set of target vertices. Unlike the default mode where all the needed data are brought to one machine, in Distributed Query mode, the computation moves across the cluster, following the traversal pattern of the query.
- The output results will be gathered at one machine.
To invoke Distributed Query mode, simply insert the keyword “DISTRIBUTED” before “QUERY” in a query definition:
Guidelines for Selecting Distributed Query Mode
The basic trade-off between distributed query mode and default mode is greater parallelism for the given query vs. using more system resources, which reduces the potential for concurrency with other operations. Each machine has a certain number of work ers
available for concurrent execution of queries. A query in default mode uses only one worker out of the whole system. (This one worker will have multiple threads for processing edge traversals in parallel.) However, a query in distributed mode uses one query worker
This means this query can run faster, but it leaves fewer workers for other queries running concurrently.
We suggest the following guidelines for deciding whether to use default mode or distributed mode.
- Queries with one or a few starting point vertices and which take only a few hops â†’ default mode is better.
Queries which start at a very large set of starting point vertices and which traverse many hops â†’ distributed mode is better.
For example, algorithms which either compute a value for every vertex or one value for the entire graph should use Distributed Mode. This includes PageRank, Centrality, and Connected Component algorithms.
- For applications where the same query (same logic but with different input parameters) will be run many times in production, the application designer can simply try both modes during development and chose the one which works better for their use case and data.
Supported and Unsupported Features
Currently, Distributed Query mode cannot be used for all queries. Please note the limitations carefully. In most cases, the GSQL parser and compiler will report an error if you try to write a Distributed Query using an unsupported feature.
The two major restrictions are
- You cannot update the graph. Distributed queries currently can only read the graph data. Accumulators can be used for computation.
- In a SELECT statement, you cannot access a target vertex’s values in the ACCUM clause. However, you can accumulate to a target vertex’s accumulator.
The following GSQL features are not supported in Distributed Query mode:
Data update to the graph
Access to target vertex’s values in ACCUM
Query calling a distributed query
UPDATE, INSERT, DELETE
WHILE inside an ACCUM clause.
|WHILE at the statement level|
exact count for LIMIT clause
LIST, SET, BAG
|Operations and Operators||
Any data update to the graph, including assignment statements to vertex attributes
|vertex and edge functions||
.outdegree() of target vertex
.neighbors(), .neighborAttribute(), .edgeAttribute()
.outdegree() of source vertex
|accumulator and collection functions||
contains(), containskey(), resize(), reallocate()
|size(), get(), top(), pop(), update(), remove(), removeAll(), clear()|
selectVertex(), to_vertex(), to_vertex_set(),
sum(), count(), min(), max(), avg()
(1) Items in the Supported column are listed only for clarity, so you can compare to the Unsupported column. If a feature which is supported in non-distributed queries is not mentioned in either column, then it is supported in
Distributed Query mode
(2) If the query contains “LIMIT N”, and if the number of GPEs working on this query is G, then the output size will be N +/- (G-1). In a conventional cluster configuation, there is one GPE per machine. For example, if N=10 and the graph is distributed across 4 machines, then the output size will be between 7 and 13, inclusive.