Distribution and Sort Keys in Amazon Redshift: A Guide to Faster Queries
With a global market share of 31% in Q3 2024, Amazon Web Services (AWS) continues to dominate the cloud market. As countless companies rely on AWS's cloud infrastructure, Amazon Redshift has become a preferred choice for cloud-based data warehousing solutions for these organizations.
Compared to other cloud data warehouse solutions like Google BigQuery or Snowflake, Amazon Redshift takes a different approach, particularly in its infrastructure and the way compute and storage are handled. These architectural differences have a significant impact on the performance of your workloads.
In this article, we’ll take a look under the hood of Amazon Redshift and explore one of its key performance drivers: sort keys and distribution keys.
The Infrastructure
Amazon Redshift operates on a cluster-based, shared-nothing architecture consisting of a leader node and one or more compute nodes.
The leader node acts as the central coordinator of the cluster and is responsible for:
Parsing SQL queries and generating optimized query execution plans
Coordinating query execution by delegating tasks to compute nodes
Aggregating results from compute nodes and delivering the final output to the user
Importantly, the leader node does not store any actual data.
The compute nodes handle the heavy lifting, including storing and processing data. Each compute node is divided into slices, with each slice managing a portion of the node's data and query workload. Data is stored in a columnar format, which minimizes I/O and accelerates analytical queries by accessing only the required columns.
Historically, Redshift’s architecture tightly coupled compute and storage. Each compute node had a fixed amount of local disk storage, meaning that scaling storage also required scaling compute. This tight coupling often led to significant amounts of data being transferred across the network between compute nodes, which could heavily impact query performance.
The introduction of RA3 instances marked a shift toward decoupling storage from compute. With RA3, compute nodes can retrieve data from managed storage as needed, reducing the necessity to scale compute resources for storage-intensive workloads.
However, even with RA3 instances, compute and storage remain part of the same cluster. As a result, workloads within the cluster share the same resources, which can lead to resource contention when multiple workloads run simultaneously.
To mitigate the performance challenges of this architecture, it is important to minimize data shuffling between nodes during query execution. This can be achieved by carefully configuring the sort and distribution keys for Redshift tables.
Distribution and Sort Keys
The distribution key determines how data is distributed across the nodes in an Amazon Redshift cluster. Proper distribution is essential for query performance because it ensures load balancing and minimizes data movement between nodes.
Distribution Styles
Amazon Redshift offers three distribution styles:
ALL - A full copy of the table is stored on every node. Ideal for small tables that are frequently joined with larger tables.
EVEN - Data is distributed randomly across nodes. Suitable for tables without a clear distribution column, but rarely optimal.
KEY - Rows are distributed based on the value of a specified column (the distribution key). Ensures that rows with the same key value are stored on the same node, minimizing data movement during joins.
Choosing the Right Distribution Style
When selecting a distribution style, there are two key considerations:
Minimize data movement: Avoid excessive network and disk I/O caused by shuffling data across nodes during queries
Ensure even data distribution: Prevent row skew by spreading data evenly across nodes to maximize query performance
Row Skew: This occurs when a table uses KEY-based distribution, and the values in the distribution key column are not evenly distributed.
For best results:
Use KEY distribution for columns frequently used in joins.
Use ALL distribution only for small tables that are joined frequently with large tables.
Sort Keys
Another way to optimize query performance in Redshift is by using sort keys, which determine how rows are physically stored on each node. Properly chosen sort keys can significantly enhance performance by reducing the amount of data scanned during queries.
Unlike distribution keys, you can specify multiple columns as sort keys. Redshift provides two types of sort keys:
Compound Sort Keys - Sorts data in the order specified by the columns in the key. Useful when queries filter or group by the first column or a prefix of the columns.
Interleaved Sort Keys - Assigns equal importance to all columns in the key. Useful for queries that filter on multiple columns in varying combinations.
Choosing Sort Keys
Select sort keys based on columns frequently used in WHERE
, JOIN
, or GROUP BY
clauses. Good sort keys enable zone maps, allowing Redshift to skip over irrelevant data, thereby reducing I/O operations.
Combining Sort and Distribution Keys
For optimal query performance:
Choose a DISTKEY to minimize data movement across nodes
Use a SORTKEY to optimize data retrieval
You can set the distribution and sort key when creating the table or even afterwards with an ALTER TABLE statement.
Set keys with CREATE TABLE statement
Set keys with ALTER TABLE statements
Note: Using ALTER TABLE may trigger data redistribution or resorting, which can impact performance during the operation. Schedule such changes during maintenance windows to minimize disruption.
Reading Execution Plans
How to Check If Your Keys Are Working for Your Workload?
To determine whether your distribution and sort keys are optimized for your workload, you need to analyze the execution plans of your queries. Execution plans provide insights into how queries are processed, highlighting potential inefficiencies.
Query Monitoring in the AWS Redshift Console:
If you have access to the AWS Redshift console, navigate to the Query Monitoring tab. Locate your query and review its execution details, including the execution plan. This approach is straightforward and provides a graphical overview of your query's performance.
Using the EXPLAIN
Command:
Add the EXPLAIN
prefix to your query to generate an execution plan:
Generate the execution plan using EXPLAIN
The output will look like the following:
Redshift query execution plan
Understanding the Execution Plan
The execution plan helps you identify how data is processed, including:
Join Types: For example,
XN Hash Join DS_DIST_NONE
indicates whether data is being redistributed or remains colocatedRows Scanned: This shows the number of rows scanned at each step, helping you verify if your sort key is effective
Width: Represents the size (in bytes) of the data being processed, indicating the overhead for data transfer
Focus on Join Types and Data Movement
One critical area to analyze in the execution plan is the join type, which reveals if and how data is being redistributed across nodes. Data shuffling between nodes is expensive and can indicate suboptimal distribution keys.
Here are the most common join types and their meanings:
DS_DIST_NONE: No redistribution needed because data is colocated on the same node
DS_DIST_ALL_NONE: No redistribution needed because the inner table uses
DISTSTYLE ALL
(replicated across all nodes)DS_DIST_INNER: The inner table is redistributed to match the outer table's data
DS_DIST_OUTER: The outer table is redistributed to match the inner table's data
DS_BCAST_INNER: A copy of the entire inner table is broadcast to all compute nodes
DS_DIST_ALL_INNER: The entire inner table is redistributed to a single slice because the outer table uses
DISTSTYLE ALL
DS_DIST_BOTH: Both tables are redistributed across nodes, indicating the most expensive type of data movement
Key Points to Watch - Ensure the join type minimizes data redistribution by colocating related data. Check the rows scanned to validate that your sort key is reducing the amount of data read.
Conclusion
Amazon Redshift is a powerful data warehousing solution, but achieving optimal performance requires a deep understanding of its architecture and key features. By leveraging distribution and sort keys effectively, you can significantly reduce data movement, improve query efficiency, and minimize resource contention within your cluster.
Analyzing execution plans is a critical step in this optimization process, providing valuable insights into how your queries are processed and where potential bottlenecks lie. Whether you’re dealing with join operations, large data scans, or resource-intensive workloads, understanding and fine-tuning your distribution and sort keys makes huge difference.
With careful planning and regular performance monitoring, you can ensure that your Redshift cluster runs efficiently, delivering fast and reliable analytics for your business. By doing so, you’ll be better positioned to harness the full potential of AWS's cloud ecosystem and stay competitive in an increasingly data-driven world.