How to add a partition key to an existing Timestream table • Mykola Yashchenko

Motivation

A partition key (PK) in AWS Timestream is a column that groups related data together. A PK distributes data across multiple storage units, allowing Timestream to scan only the relevant partition(s) rather than the entire dataset.

For example, if you store user data and use user_id as the partition key, queries filtering by a specific user_id will run much faster than in a non-partitioned table, as Timestream can quickly identify which subset of data to scan.

However, choosing an effective partition key isn’t always straightforward. Many projects start without a partition key because data patterns are unclear or requirements continue to evolve.

AWS Timestream, while powerful, doesn’t provide a direct method for adding partition keys to existing tables. This limitation can pose challenges for growing datasets that didn’t initially implement partitioning.

This guide will present a strategy to overcome these challenges, helping you optimize your Timestream tables even when partition keys weren’t part of the initial design.

Step-by-Step Guide

Adding a partition key to an existing Timestream table involves several steps. Here’s the approach we’ll take:

Creating a partitioned table:
- Defining the new table with the chosen partition key and enabled magnetic writes
Updating data ingestion:
- Modifying the data ingestion process to write to both old and new tables simultaneously
Performing data migration:
- Using Timestream Scheduled Queries to migrate data from the old table to the new one
Validating data integrity:
- Comparing record counts and sample data between old and new tables
- Running test queries to ensure expected performance improvements
Switching application traffic:
- Updating application code to read from the new table
Decommissioning the old table:
- Stopping writes to the old table
- Scheduling the old table for deletion after a safe period

In the following sections, we’ll dive deeper into each step, with a particular focus on the data migration process using Timestream Scheduled Query.

Sample infrastructure

Before we dive into the migration process, let’s outline a typical infrastructure setup you might have when working with Timestream. This setup will serve as our base scenario for the migration.

const database = new CfnDatabase(this, 'TimestreamDatabase', {
  databaseName: 'fitness-database'
});

// Existing Timestream table without a partition key
const table = new CfnTable(this, 'TimestreamTable', {
  databaseName: database.ref,
  tableName: 'fitness-table',
  retentionProperties: {
    MemoryStoreRetentionPeriodInHours: '24',
    MagneticStoreRetentionPeriodInDays: '365'
  }
});

const ingestion = new Function(this, 'TimestreamIngester', {
  runtime: Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: Code.fromAsset('lambda'),
  environment: {
    TIMESTREAM_DATABASE_NAME: database.ref,
    TIMESTREAM_TABLE_NAME: table.ref,
  },
});

We have a Timestream database, a Timestream table, and a Lambda function that ingests fitness data - metrics such as a user’s heart rate and/or something similar.

The schema of the table looks as follows:

time (Timestamp)
measure_name (String)
value (Double)
user_id (Dimension)

Without a partition key, queries on this table may become slower as the dataset grows, particularly when filtering or grouping by user_id. Implementing a user_id partition key can significantly enhance the performance of these queries.

With that in mind, let’s proceed to the migration process.

Creating a new table

As you’ve already noted, the goal is to partition the table by user_id. Therefore, let’s create a new Timestream table with the desired partition key:

const tableV2 = new CfnTable(this, 'TimestreamTableV2', {
  databaseName: database.ref,
  tableName: 'fitness-table-v2',
  retentionProperties: {
    memoryStoreRetentionPeriodInHours: '24',
    magneticStoreRetentionPeriodInDays: '365'
  },
  magneticStoreWriteProperties: {
    EnableMagneticStoreWrites: true,
  },
  schema: {
    compositePartitionKey: [
      {
        name: 'user_id',
        type: 'DIMENSION'
      }
    ]
  }
});

Updating data ingestion

After creating the table with the partition key, it’s crucial to update all data ingestion processes to write to both old and new tables simultaneously. This ensures data consistency during the migration process.

Key points to consider:

Identify all ingestion points:
- This may include Lambda functions, direct application writes, or any other processes writing to the original table.
Implement parallel ingestion:
- For each identified ingestion point, implement a corresponding process to write to the new table.
Ensure data consistency:
- The data written to both tables must be identical. Pay special attention to non-deterministic values like timestamps. If your current ingestion uses functions like NOW() or CURRENT_TIMESTAMP(), consider passing a fixed timestamp to both write operations. This helps avoid potential data duplication during migration that can occur because of slight timestamp differences.
Test thoroughly:
- Verify that data is being written correctly and consistently to both tables before proceeding to the migration step.

In our example, we have a Lambda function responsible for data ingestion. To start writing to the new table, we simply duplicate this function and pass different environment variables to direct the data to the new table. The original ingestion function remains unchanged.

const ingestionV2 = new Function(this, 'TimestreamIngesterV2', {
  runtime: Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: Code.fromAsset('lambda'),
  environment: {
    TIMESTREAM_DATABASE_NAME: database.ref,
    TIMESTREAM_TABLE_NAME: tableV2.ref,
  },
});

Performing data migration

While manual data migration is possible, automating the process can save time, reduce errors, and provide better monitoring and control. In this section, you’ll find out how to use AWS CDK to set up an automated migration pipeline that includes:

A Timestream Scheduled Query for data migration
An SNS topic to publish migration results
An SQS queue subscribed to the SNS topic for further processing or monitoring

Let’s look at the CDK code to set up this infrastructure:

const migrationBucket = new Bucket(this, 'MigrationBucket', {
  bucketName: 'timestream-migration-bucket',
});

const migrationTopic = new Topic(this, 'MigrationTopic', {
  topicName: 'timestream-migration-results'
});

const migrationQueue = new Queue(this, 'MigrationQueue', {
  queueName: 'timestream-migration-queue'
});

migrationTopic.addSubscription(new SqsSubscription(migrationQueue));

const scheduledQueryRole = new Role(this, 'ScheduledQueryRole', {
  assumedBy: new ServicePrincipal('timestream.amazonaws.com')
});

scheduledQueryRole.addToPolicy(
  new PolicyStatement({
    effect: Effect.ALLOW,
    actions: [
      'timestream:Select',
      'timestream:DescribeTable',
      'timestream:ListMeasures',
    ],
    resources: [table.attrArn]
  })
);

scheduledQueryRole.addToPolicy(
  new PolicyStatement({
    effect: Effect.ALLOW,
    actions: ['timestream:WriteRecords'],
    resources: [tableV2.attrArn]
  })
);

scheduledQueryRole.addToPolicy(
  new PolicyStatement({
    effect: Effect.ALLOW,
    actions: ['sns:Publish'],
    resources: [migrationTopic.topicArn]
  })
);

scheduledQueryRole.addToPolicy(
  new PolicyStatement({
    effect: Effect.ALLOW,
    actions: [
      's3:PutObject',
      's3:GetBucketAcl'
    ],
    resources: [
      migrationBucket.bucketArn,
      `${migrationBucket.bucketArn}/*`
    ],
  })
);

const scheduledQuery = new CfnScheduledQuery(this, 'MigrationScheduledQuery', {
  queryString: `
    SELECT
      user_id,
      time,
      measure_name,
      value
    FROM "${database.databaseName}"."${table.tableName}"
    WHERE time BETWEEN @scheduled_runtime - 2d AND @scheduled_runtime`,
  scheduleConfiguration: {
    scheduleExpression: 'cron(0 0 1 1 ? *)'
  },
  notificationConfiguration: {
    snsConfiguration: {
      topicArn: migrationTopic.topicArn
    }
  },
  scheduledQueryName: 'timestream-migration-query',
  errorReportConfiguration: {
    s3Configuration: {
      bucketName: migrationBucket.bucketName,
      objectKeyPrefix: 'timestream-migration-errors'
    }
  },
  scheduledQueryExecutionRoleArn: scheduledQueryRole.roleArn,
  targetConfiguration: {
    timestreamConfiguration: {
      databaseName: database.databaseName,
      tableName: tableV2.tableName,
      dimensionMappings: [
        {
          name: 'user_id',
          dimensionValueType: 'VARCHAR'
        }
      ],
      measureNameColumn: 'measure_name',
      timeColumn: 'time',
      multiMeasureMappings: [
        {
          sourceColumn: 'value',
          measureValueType: 'DOUBLE'
        }
      ]
    }
  }
});

scheduledQuery.node.addDependency(scheduledQueryRole);

const seeder = new TimestreamSeeder({ ... });
scheduledQuery.node.addDependency(seeder);

The CDK code above contains the following resources:

S3 Bucket: for logging any migration errors that may occur.
SNS Topic: for publishing migration results, which enables you to know when the migration has completed.
SQS Queue: for further processing or monitoring of migration results.
IAM Role: for providing necessary permissions to execute the Scheduled Query, read from the old table, write to the new table, publish to the SNS topic, and log errors in the S3 bucket.
Scheduled Query: for moving data from the old table to the new one.
Timestream Seeder: for replicating the table schema.

The Scheduled Query uses a special parameter @scheduled_runtime, which Timestream automatically provides during execution. This parameter is key to our migration. The example query migrates data in 2-day chunks. To complete the full migration you need:

trigger the query multiple times
for each run, manually adjust @scheduled_runtime, moving it back by 2 days

Adjust the 2 day interval based on your data volume, considering Timestream’s limitations (e.g. 5Gb maximum data size for a query result and 1 hour max query duration). For large datasets, you might need smaller intervals (e.g., 6 hours); for smaller datasets, larger intervals could work. This approach helps you avoid AWS service limits and provides checkpoints for validation.

To streamline the process, use a short script that will trigger the query multiple times and retrieve the execution results from the SQS. You should run it on your local machine.

import { DeleteMessageCommand, GetQueueUrlCommand, ReceiveMessageCommand, SQSClient } from '@aws-sdk/client-sqs';
import { ExecuteScheduledQueryCommand, TimestreamQueryClient } from '@aws-sdk/client-timestream-query';
import { fromSSO } from '@aws-sdk/credential-providers';

const args = require('yargs').argv;
const AWS_PROFILE = args['aws-profile'];
const AWS_REGION: string = args['aws-region'];
const SCHEDULED_QUERY_ARN = args['scheduled-query-arn'];
const QUEUE_NAME = args['queue-name'];

if (!AWS_PROFILE) throw new Error('Missing required argument [--aws-profile]');
if (!AWS_REGION) throw new Error('Missing required argument [--aws-region]');
if (!SCHEDULED_QUERY_ARN) throw new Error('Missing required argument [--scheduled-query-arn]');
if (!QUEUE_NAME) throw new Error('Missing required argument [--queue-name]');

const timestreamClient = new TimestreamQueryClient({
  region: AWS_REGION,
  credentials: fromSSO({ profile: AWS_PROFILE })
});

const sqsClient = new SQSClient({
  region: AWS_REGION,
  credentials: fromSSO({ profile: AWS_PROFILE })
});

const migrationPeriod = 365;
const stepInDays = 2;

migrate();

async function migrate() {
  const { QueueUrl: queueUrl } = await sqsClient.send(
    new GetQueueUrlCommand({
      QueueName: QUEUE_NAME
    })
  );

  const date = new Date();
  date.setUTCHours(23, 59, 59, 999);

  let remainingDays = migrationPeriod;
  do {
    await triggerQuery(date, queueUrl);
    date.setUTCDate(date.getUTCDate() - stepInDays);
    remainingDays -= stepInDays;
  } while (remainingDays > 0);
}

async function triggerQuery(date: Date, queueUrl: string) {
  console.log(`Migrating data for ${date.toISOString()}`);

  await timestreamClient.send(
    new ExecuteScheduledQueryCommand({
      InvocationTime: date,
      ScheduledQueryArn: SCHEDULED_QUERY_ARN,
    })
  );

  for (let attempt = 0; attempt < 20; attempt++) {
    const { Messages: messages } = await sqsClient.send(
      new ReceiveMessageCommand({
        QueueUrl: queueUrl as string,
        MaxNumberOfMessages: 5,
        WaitTimeSeconds: 30,
      })
    );

    if (messages) {
      let currentIterationCompleted = false;

      for (const message of messages) {
        const body = JSON.parse(message.Body as string);

        if (body.type === 'MANUAL_TRIGGER_FAILURE') {
          console.error(
            `Error while running the scheduled query for date ${date}; it may need to be retried again`,
            body
          );
        }

        if (body.scheduledQueryRunSummary && body.scheduledQueryRunSummary.invocationEpochSecond) {
          const desiredTimestamp = Math.ceil(date.getTime() / 1000);
          const actualTimestamp = body.scheduledQueryRunSummary.invocationEpochSecond;

          if (desiredTimestamp === actualTimestamp) {
            currentIterationCompleted = true;
          }
        }

        await sqsClient.send(
          new DeleteMessageCommand({
            QueueUrl: queueUrl,
            ReceiptHandle: message.ReceiptHandle,
          })
        );
      }

      if (currentIterationCompleted) break;
    }
  }
}

The script can be run as follows:

ts-node ./script.ts --aws-profile=<profile> --aws-region=<region> --scheduled-query-arn=<sqarn> --queue-name=<qname>

First of all, before running this script, you’ll need to customize it to fit your needs. Specifically, you’ll need to set your own migration period (the time range for migrating data) and define the migration step in days, which should match the value specified in the Scheduled Query.

Second, the script logs failed Scheduled Query executions but doesn’t provide a built-in mechanism for retries. You may want to modify it to handle this. I highly recommend testing it in your pre-production environment before moving to production to ensure everything runs smoothly. If any issues arise, you can still trigger individual executions manually.

Lastly, the current implementation runs sequentially, which can significantly extend the migration time. You may want to enhance it to trigger multiple executions in parallel for greater efficiency.

Validating data integrity

After the data is migrated, you need to verify whether the migration was complete and if the data fully matches what’s in the old table.

Naturally, if the script from the previous section encountered any errors, it would be wise to investigate them to identify what went wrong.

If no issues occurred, you can run a few SELECT queries to ensure that the record counts match and that querying the same dimensions/time returns identical results.

For the table from our example, a query like follows might be a good start:

SELECT 'v1' tab, COUNT(*) cnt
FROM "fitness-database"."fitness-table"
WHERE time < ago(10m) AND time > ago(365d)
UNION
SELECT 'v2' tab, COUNT(*) cnt
FROM "fitness-database"."fitness-table-v2"
WHERE time < ago(10m) AND time > ago(365d)

This yields the total number of records in both tables.

You may wonder why time conditions are necessary - this is because both ingestion lambdas continue ingesting records at different pace, which can lead to slight discrepancies in record counts. Additionally, Timestream may evict older records at varying intervals. Therefore, you may need to adjust the condition to satisfy your retention period.

One issue I encountered when running a similar query was that the counts differed because the new table had magnetic writes enabled, allowing the new ingester to store records that the original lambda function dismissed. If you notice discrepancies in the counts, it may be helpful to check if this is the cause.

Another query you could run is to identify which specific records are present in the old table but missing from the new one.

It can be a bit challenging to write if you have a large number of dimensions (since the JOIN condition requires specifying all of them), but it’s still worth running. You could even use AI to assist you in generating the query.

WITH t1 AS (
  SELECT *
  FROM "fitness-database"."fitness-table"
  WHERE time < ago(10m) and time > ago(365d)
),
t2 AS (
  SELECT *
  FROM "fitness-database"."fitness-table-v2"
  WHERE time < ago(10m) and time > ago(365d)
)
SELECT
  user_id,
  measure_name,
  time,
  value
FROM t1
LEFT JOIN t2
ON
  t1.user_id = t2.user_id
AND
  t1.measure_name = t2.measure_name
AND
  t1.time = t2.time
WHERE
  t2.user_id IS NULL

Review the records returned by the query, if there are any, and determine why they are missing in the new table.

Switching application traffic

This step involves updating the source table for your existing queries, whether they’re used by applications (such as lambdas, EC2 instances), scheduled queries, etc. Simply redirect them to the new table and monitor the results. To minimize risk, switch them over one at a time and test thoroughly in a pre-production environment. Also, don’t forget to update permissions to ensure access to the new table.

Decommissioning the old table

The final step is optional - once all the data has been successfully migrated to the new table, you may choose to delete the old table to reduce storage costs. However, you might prefer to keep it for a while, especially since Timestream will automatically delete data over time based on its retention period.

Summary

Adding partition keys to existing Amazon Timestream tables is a valuable optimization that can significantly enhance query performance and reduce costs. This article has outlined a practical approach to implement this change.

Remember to tailor the migration strategy to your specific data volumes and query patterns. With careful planning and execution, you can successfully optimize your Timestream tables, setting a strong foundation for efficient time-series data management as your applications scale.