4

I have one table in AWS Dynamodb with 1 million records.is it possible to query array of primary key values in one query with additional sort key condition in dynamodb?I am using for my server side logic.

Here is the params

var params = {
TableName: "client_logs",
KeyConditionExpression: "#accToken = :value AND ts between :val1 and 
:val2", 
ExpressionAttributeNames: {
"#accToken": "acc_token"
},
ExpressionAttributeValues: {
        ":value": clientAccessToken,
        ":val1": parseInt(fromDate),
        ":val2": parseInt(toDate),
        ":status":confirmStatus
},
FilterExpression:"apiAction = :status"


};

Here acc_token is the primary key and I want to query array of access_token values in one single query.

2 Answers 2

1

No, it is not possible. A single query may search only one specific hash key value. (See DynamoDB – Query.)

You can, however, execute multiple queries in parallel, which will have the effect you desire.

Edit (2018-11-21)

Since you said there are 200+ hash keys that you are looking for, here are two possible solutions. These solutions do not require unbounded, parallel calls to DynamoDB, but they will cost you more RCU. They may be faster or slower, depending on the distribution of data in your table.

I don't know the distribution of your data, so I can't say which one is best for you. In all cases, we can't use acc_token as the sort key of the GSI because you can't use the IN operator in a KeyConditionExpression. (See DynamoDB – Condition.)

Solution 1

This strategy is based on Global Secondary Index Write Sharding for Selective Table Queries

Steps:

  1. Add a new attribute to items that you write to your table. This new attribute can be a number or string. Let's call it index_partition.
  2. When you write a new item to your table, give it a random value from 0 to N for index_partition. (Here, N is some arbitrary constant of your choice. 9 is probably an okay value to start with.)
  3. Create a GSI with hash key of index_partition and a sort key of ts. You will need to project apiAction and acc_token to the GSI.
  4. Now, you only need to execute N queries. Use a key condition expression of index_partition = :n AND ts between :val1 and :val2 and a filter expression of apiAction = :status AND acc_token in :acc_token_list

Solution 2

This solution is similar to the last, but instead of using random GSI sharding, we'll use a date based partition for the GSI.

Steps:

  1. Add a new string attribute to items that you write to your table. Let's call it ts_ymd.
  2. When you write a new item to your table, use just the yyyy-mm-dd part of ts to set the value of ts_ymd. (You could use any granularity you like. It depends on your typical query range for ts. If :val1 and :val2 are typically only an hour apart from each other, then a suitable GSI partition key could be yyyy-mm-dd-hh.)
  3. Create a GSI with hash key of ts_ymd and a sort key of ts. You will need to project apiAction and acc_token to the GSI.
  4. Assuming you went with yyyy-mm-dd for your GSI partition key, you only need to execute one query for every day that is within :val1 and :val2. Use a key condition expression of ts_ymd = :ymd AND ts between :val1 and :val2 and a filter expression of apiAction = :status AND acc_token in :acc_token_list

Solution 3

I don't know how many different values of apiAction there are and how those values are distributed, but if there are more than a few, and they have approximately equal distribution, you could partition a GSI based on that value. The more possible values you have for apiAction, the better this solution is for you. The limiting factor here is that you need to have enough values that you won't run into the 10GB partition limit for your GSI.

Steps:

  1. Create a GSI with hash key of apiAction and a sort key of ts. You will need to project acc_token to the GSI.
  2. You only need to execute one query. Use a key condition expression of apiAction = :status AND ts between :val1 and :val2" and a filter expression ofacc_token in :acc_token_list`.

For all of these solutions, you should consider how evenly the GSI partition key will be distributed, and the size of the typical range for ts in your query. You must use a filter expression on acc_token, so you should try to pick a solution that minimizes the total number of items the will match your key condition expression, but at the same time, you need to be aware that you can't have more than 10GB of data for one partition key (for the table or for a GSI). You also need to remember that a GSI can only be queried as an eventually consistent read.

7
  • But I have close to 200 items in my array.and the number may increase in future.I think that is not a correct approach to query 200+ times.Please suggest any other way if I can do this
    – Test Mail
    Commented Nov 21, 2018 at 21:20
  • Are these 200 keys always the same? Commented Nov 21, 2018 at 21:23
  • Or would it be acceptable to query without the between function? If that’s okay, or if you’re okay with using filter expressions, then a solution is possible using a Global Secondary Index. Commented Nov 21, 2018 at 21:32
  • Thanks for your reply.Yes certainly that 200 keys are always same.But new keys getting added over the time.Unfortunately the developers who have done initial development did not create any indexes and now it became close to one million.is there any other way I can alter now after creating indexes or do dynamodb have any other feature to copy table to another table within the region?So that I can do this experiments on new table instead of doing it in production data.
    – Test Mail
    Commented Nov 22, 2018 at 4:11
  • .I know we can use data pipeline and s3 but with out using any other service can we do copy in dynamo it self to save cost?please advice
    – Test Mail
    Commented Nov 22, 2018 at 4:28
0

You can efficiently do both query range of partition keys and apply additional condition on sort key with the help of PartiQL SELECT query. Official DDB documentation says:

To ensure that a SELECT statement does not result in a full table scan, the WHERE clause condition must specify a partition key. Use the equality or IN operator.

The documentation doesn't mention specifically sort key, but it says that additional filtration on non-key attribute still does NOT cause the full scan. So I am almost sure a condition on sort key with one of supported operators won't cause a table scan, executes fast and consumes as few capacity units as possible.

So your query may look like this:

SELECT * FROM client_logs WHERE acc_token IN (t1, t2, ...) AND ts BETWEEN t1 AND t2

Node.js examples of PartiQL API usage can be found here.

1
  • Worth mentioning that the cost of this operation is proportional to the size of the table. As quoted from the page linked here: > Using the SELECT statement can result in a full table scan if an equality condition with a partition key is not provided in the WHERE clause. A scan operation examines every item for the requested values and can use up the provisioned throughput for a large table or index in a single operation. so, someone familiar with SQL may be tempted by this approach, but must be aware that the partition key won't be used at all as would be an indexed column in SQL
    – vincent
    Commented Oct 15, 2022 at 6:32

Not the answer you're looking for? Browse other questions tagged or ask your own question.