Distribution and Sort Keys in Amazon Redshift: A Guide to Faster Queries

Infographic: Amazon Maintains Cloud Lead as Microsoft Edges Closer | Statista

With a global market share of 31% in Q3 2024, Amazon Web Services (AWS) continues to dominate the cloud market. As countless companies rely on AWS's cloud infrastructure, Amazon Redshift has become a preferred choice for cloud-based data warehousing solutions for these organizations.

Compared to other cloud data warehouse solutions like Google BigQuery or Snowflake, Amazon Redshift takes a different approach, particularly in its infrastructure and the way compute and storage are handled. These architectural differences have a significant impact on the performance of your workloads.

In this article, we’ll take a look under the hood of Amazon Redshift and explore one of its key performance drivers: sort keys and distribution keys.

The Infrastructure

Amazon Redshift operates on a cluster-based, shared-nothing architecture consisting of a leader node and one or more compute nodes.

The leader node acts as the central coordinator of the cluster and is responsible for:

  • Parsing SQL queries and generating optimized query execution plans

  • Coordinating query execution by delegating tasks to compute nodes

  • Aggregating results from compute nodes and delivering the final output to the user

Importantly, the leader node does not store any actual data.

The compute nodes handle the heavy lifting, including storing and processing data. Each compute node is divided into slices, with each slice managing a portion of the node's data and query workload. Data is stored in a columnar format, which minimizes I/O and accelerates analytical queries by accessing only the required columns.

Historically, Redshift’s architecture tightly coupled compute and storage. Each compute node had a fixed amount of local disk storage, meaning that scaling storage also required scaling compute. This tight coupling often led to significant amounts of data being transferred across the network between compute nodes, which could heavily impact query performance.

The introduction of RA3 instances marked a shift toward decoupling storage from compute. With RA3, compute nodes can retrieve data from managed storage as needed, reducing the necessity to scale compute resources for storage-intensive workloads.

However, even with RA3 instances, compute and storage remain part of the same cluster. As a result, workloads within the cluster share the same resources, which can lead to resource contention when multiple workloads run simultaneously.

To mitigate the performance challenges of this architecture, it is important to minimize data shuffling between nodes during query execution. This can be achieved by carefully configuring the sort and distribution keys for Redshift tables.

Distribution and Sort Keys

The distribution key determines how data is distributed across the nodes in an Amazon Redshift cluster. Proper distribution is essential for query performance because it ensures load balancing and minimizes data movement between nodes.

Distribution Styles

Amazon Redshift offers three distribution styles:

  1. ALL - A full copy of the table is stored on every node. Ideal for small tables that are frequently joined with larger tables.

  2. EVEN - Data is distributed randomly across nodes. Suitable for tables without a clear distribution column, but rarely optimal.

  3. KEY - Rows are distributed based on the value of a specified column (the distribution key). Ensures that rows with the same key value are stored on the same node, minimizing data movement during joins.

Choosing the Right Distribution Style

When selecting a distribution style, there are two key considerations:

  • Minimize data movement: Avoid excessive network and disk I/O caused by shuffling data across nodes during queries

  • Ensure even data distribution: Prevent row skew by spreading data evenly across nodes to maximize query performance

Row Skew: This occurs when a table uses KEY-based distribution, and the values in the distribution key column are not evenly distributed.

For best results:

  • Use KEY distribution for columns frequently used in joins.

  • Use ALL distribution only for small tables that are joined frequently with large tables.

Sort Keys

Another way to optimize query performance in Redshift is by using sort keys, which determine how rows are physically stored on each node. Properly chosen sort keys can significantly enhance performance by reducing the amount of data scanned during queries.

Unlike distribution keys, you can specify multiple columns as sort keys. Redshift provides two types of sort keys:

  1. Compound Sort Keys - Sorts data in the order specified by the columns in the key. Useful when queries filter or group by the first column or a prefix of the columns.

  2. Interleaved Sort Keys - Assigns equal importance to all columns in the key. Useful for queries that filter on multiple columns in varying combinations.

Choosing Sort Keys

Select sort keys based on columns frequently used in WHERE, JOIN, or GROUP BY clauses. Good sort keys enable zone maps, allowing Redshift to skip over irrelevant data, thereby reducing I/O operations.

Combining Sort and Distribution Keys

For optimal query performance:

  • Choose a DISTKEY to minimize data movement across nodes

  • Use a SORTKEY to optimize data retrieval

You can set the distribution and sort key when creating the table or even afterwards with an ALTER TABLE statement.

Set keys with CREATE TABLE statement

Set keys with ALTER TABLE statements

Note: Using ALTER TABLE may trigger data redistribution or resorting, which can impact performance during the operation. Schedule such changes during maintenance windows to minimize disruption.

Reading Execution Plans

How to Check If Your Keys Are Working for Your Workload?

To determine whether your distribution and sort keys are optimized for your workload, you need to analyze the execution plans of your queries. Execution plans provide insights into how queries are processed, highlighting potential inefficiencies.

Query Monitoring in the AWS Redshift Console:

If you have access to the AWS Redshift console, navigate to the Query Monitoring tab. Locate your query and review its execution details, including the execution plan. This approach is straightforward and provides a graphical overview of your query's performance.

Using the EXPLAIN Command:

Add the EXPLAIN prefix to your query to generate an execution plan:

Generate the execution plan using EXPLAIN

The output will look like the following:

Redshift query execution plan

Understanding the Execution Plan

The execution plan helps you identify how data is processed, including:

  • Join Types: For example, XN Hash Join DS_DIST_NONE indicates whether data is being redistributed or remains colocated

  • Rows Scanned: This shows the number of rows scanned at each step, helping you verify if your sort key is effective

  • Width: Represents the size (in bytes) of the data being processed, indicating the overhead for data transfer

Focus on Join Types and Data Movement

One critical area to analyze in the execution plan is the join type, which reveals if and how data is being redistributed across nodes. Data shuffling between nodes is expensive and can indicate suboptimal distribution keys.

Here are the most common join types and their meanings:

  • DS_DIST_NONE: No redistribution needed because data is colocated on the same node

  • DS_DIST_ALL_NONE: No redistribution needed because the inner table uses DISTSTYLE ALL (replicated across all nodes)

  • DS_DIST_INNER: The inner table is redistributed to match the outer table's data

  • DS_DIST_OUTER: The outer table is redistributed to match the inner table's data

  • DS_BCAST_INNER: A copy of the entire inner table is broadcast to all compute nodes

  • DS_DIST_ALL_INNER: The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL

  • DS_DIST_BOTH: Both tables are redistributed across nodes, indicating the most expensive type of data movement

Key Points to Watch - Ensure the join type minimizes data redistribution by colocating related data. Check the rows scanned to validate that your sort key is reducing the amount of data read.

Conclusion

Amazon Redshift is a powerful data warehousing solution, but achieving optimal performance requires a deep understanding of its architecture and key features. By leveraging distribution and sort keys effectively, you can significantly reduce data movement, improve query efficiency, and minimize resource contention within your cluster.

Analyzing execution plans is a critical step in this optimization process, providing valuable insights into how your queries are processed and where potential bottlenecks lie. Whether you’re dealing with join operations, large data scans, or resource-intensive workloads, understanding and fine-tuning your distribution and sort keys makes huge difference.

With careful planning and regular performance monitoring, you can ensure that your Redshift cluster runs efficiently, delivering fast and reliable analytics for your business. By doing so, you’ll be better positioned to harness the full potential of AWS's cloud ecosystem and stay competitive in an increasingly data-driven world.

Zurück
Zurück

Data Warehouse or Data Lakehouse - Which One is Right for You?

Weiter
Weiter

Das böse Wort Self-Service - und wie es dennoch funktionieren kann