How to add a partition key to an existing Timestream table
/ 13 min read
Updated:Motivation
A partition key (PK) in AWS Timestream is a column that groups related data together. A PK distributes data across multiple storage units, allowing Timestream to scan only the relevant partition(s) rather than the entire dataset.
For example, if you store user data and use user_id
as the partition key, queries filtering by a specific user_id
will
run much faster than in a non-partitioned table, as Timestream can quickly identify which subset of data to scan.
However, choosing an effective partition key isn’t always straightforward. Many projects start without a partition key because data patterns are unclear or requirements continue to evolve.
AWS Timestream, while powerful, doesn’t provide a direct method for adding partition keys to existing tables. This limitation can pose challenges for growing datasets that didn’t initially implement partitioning.
This guide will present a strategy to overcome these challenges, helping you optimize your Timestream tables even when partition keys weren’t part of the initial design.
Step-by-Step Guide
Adding a partition key to an existing Timestream table involves several steps. Here’s the approach we’ll take:
- Creating a partitioned table:
- Defining the new table with the chosen partition key and enabled magnetic writes
- Updating data ingestion:
- Modifying the data ingestion process to write to both old and new tables simultaneously
- Performing data migration:
- Using Timestream Scheduled Queries to migrate data from the old table to the new one
- Validating data integrity:
- Comparing record counts and sample data between old and new tables
- Running test queries to ensure expected performance improvements
- Switching application traffic:
- Updating application code to read from the new table
- Decommissioning the old table:
- Stopping writes to the old table
- Scheduling the old table for deletion after a safe period
In the following sections, we’ll dive deeper into each step, with a particular focus on the data migration process using Timestream Scheduled Query.
Sample infrastructure
Before we dive into the migration process, let’s outline a typical infrastructure setup you might have when working with Timestream. This setup will serve as our base scenario for the migration.
We have a Timestream database, a Timestream table, and a Lambda function that ingests fitness data - metrics such as a user’s heart rate and/or something similar.
The schema of the table looks as follows:
- time (Timestamp)
- measure_name (String)
- value (Double)
- user_id (Dimension)
Without a partition key, queries on this table may become slower as the dataset grows, particularly when filtering or
grouping by user_id
. Implementing a user_id
partition key can significantly enhance the performance of these queries.
With that in mind, let’s proceed to the migration process.
Creating a new table
As you’ve already noted, the goal is to partition the table by user_id
. Therefore, let’s create a new Timestream
table with the desired partition key:
Updating data ingestion
After creating the table with the partition key, it’s crucial to update all data ingestion processes to write to both old and new tables simultaneously. This ensures data consistency during the migration process.
Key points to consider:
-
Identify all ingestion points:
- This may include Lambda functions, direct application writes, or any other processes writing to the original table.
-
Implement parallel ingestion:
- For each identified ingestion point, implement a corresponding process to write to the new table.
-
Ensure data consistency:
- The data written to both tables must be identical. Pay special attention to non-deterministic values like timestamps. If
your current ingestion uses functions like
NOW()
orCURRENT_TIMESTAMP()
, consider passing a fixed timestamp to both write operations. This helps avoid potential data duplication during migration that can occur because of slight timestamp differences.
- The data written to both tables must be identical. Pay special attention to non-deterministic values like timestamps. If
your current ingestion uses functions like
-
Test thoroughly:
- Verify that data is being written correctly and consistently to both tables before proceeding to the migration step.
In our example, we have a Lambda function responsible for data ingestion. To start writing to the new table, we simply duplicate this function and pass different environment variables to direct the data to the new table. The original ingestion function remains unchanged.
Performing data migration
While manual data migration is possible, automating the process can save time, reduce errors, and provide better monitoring and control. In this section, you’ll find out how to use AWS CDK to set up an automated migration pipeline that includes:
- A Timestream Scheduled Query for data migration
- An SNS topic to publish migration results
- An SQS queue subscribed to the SNS topic for further processing or monitoring
Let’s look at the CDK code to set up this infrastructure:
The CDK code above contains the following resources:
- S3 Bucket: for logging any migration errors that may occur.
- SNS Topic: for publishing migration results, which enables you to know when the migration has completed.
- SQS Queue: for further processing or monitoring of migration results.
- IAM Role: for providing necessary permissions to execute the Scheduled Query, read from the old table, write to the new table, publish to the SNS topic, and log errors in the S3 bucket.
- Scheduled Query: for moving data from the old table to the new one.
- Timestream Seeder: for replicating the table schema.
The Scheduled Query uses a special parameter @scheduled_runtime
, which Timestream automatically provides during execution.
This parameter is key to our migration. The example query migrates data in 2-day chunks. To complete the full
migration you need:
- trigger the query multiple times
- for each run, manually adjust
@scheduled_runtime
, moving it back by 2 days
Adjust the 2 day interval based on your data volume, considering Timestream’s limitations (e.g. 5Gb maximum data size for a query result and 1 hour max query duration). For large datasets, you might need smaller intervals (e.g., 6 hours); for smaller datasets, larger intervals could work. This approach helps you avoid AWS service limits and provides checkpoints for validation.
To streamline the process, use a short script that will trigger the query multiple times and retrieve the execution results from the SQS. You should run it on your local machine.
The script can be run as follows:
First of all, before running this script, you’ll need to customize it to fit your needs. Specifically, you’ll need to set your own migration period (the time range for migrating data) and define the migration step in days, which should match the value specified in the Scheduled Query.
Second, the script logs failed Scheduled Query executions but doesn’t provide a built-in mechanism for retries. You may want to modify it to handle this. I highly recommend testing it in your pre-production environment before moving to production to ensure everything runs smoothly. If any issues arise, you can still trigger individual executions manually.
Lastly, the current implementation runs sequentially, which can significantly extend the migration time. You may want to enhance it to trigger multiple executions in parallel for greater efficiency.
Validating data integrity
After the data is migrated, you need to verify whether the migration was complete and if the data fully matches what’s in the old table.
Naturally, if the script from the previous section encountered any errors, it would be wise to investigate them to identify what went wrong.
If no issues occurred, you can run a few SELECT
queries to ensure that
the record counts match and that querying the same dimensions/time returns identical results.
For the table from our example, a query like follows might be a good start:
This yields the total number of records in both tables.
You may wonder why time conditions are necessary - this is because both ingestion lambdas continue ingesting records at different pace, which can lead to slight discrepancies in record counts. Additionally, Timestream may evict older records at varying intervals. Therefore, you may need to adjust the condition to satisfy your retention period.
One issue I encountered when running a similar query was that the counts differed because the new table had magnetic writes enabled, allowing the new ingester to store records that the original lambda function dismissed. If you notice discrepancies in the counts, it may be helpful to check if this is the cause.
Another query you could run is to identify which specific records are present in the old table but missing from the new one.
It can be a bit challenging to write if you have a large number of dimensions (since the JOIN
condition requires
specifying all of them), but it’s still worth running. You could even use AI to assist you in generating the query.
Review the records returned by the query, if there are any, and determine why they are missing in the new table.
Switching application traffic
This step involves updating the source table for your existing queries, whether they’re used by applications (such as lambdas, EC2 instances), scheduled queries, etc. Simply redirect them to the new table and monitor the results. To minimize risk, switch them over one at a time and test thoroughly in a pre-production environment. Also, don’t forget to update permissions to ensure access to the new table.
Decommissioning the old table
The final step is optional - once all the data has been successfully migrated to the new table, you may choose to delete the old table to reduce storage costs. However, you might prefer to keep it for a while, especially since Timestream will automatically delete data over time based on its retention period.
Summary
Adding partition keys to existing Amazon Timestream tables is a valuable optimization that can significantly enhance query performance and reduce costs. This article has outlined a practical approach to implement this change.
Remember to tailor the migration strategy to your specific data volumes and query patterns. With careful planning and execution, you can successfully optimize your Timestream tables, setting a strong foundation for efficient time-series data management as your applications scale.