At Gamesight, we use Amazon Redshift to perform the in-database transformations and aggregations for the ELT pipeline that powers our real-time analytics. Redshift’s massively parallel architecture is well suited for this workload and gives us great performance when transforming large amounts of data. However, once we have our data summarized in Redshift, we need to be give our end users interactive access to selective row level queries to power our application. This is where Redshift can struggle.
While our data pipeline and analytics workloads are doing operations on large datasets, many of the queries that power our interface pull only a handful of records at a time. While Redshift excels at scanning terabytes of data, the tradeoffs made in Redshift to enable that performance make it a poor choice as a database for large volumes of highly selective queries with minimal aggregation. In our environment, we saw that some simple operations such as looking up records by ID from our Redshift cluster could take upwards of 5 seconds under moderate load, which leads to a poor end user experience.
We needed to find a solution that could handle both aggregations for analytical queries and be able to handle selective lookups from our application. Luckily for us, there is a detailed article on the AWS blog covering how to use Postgres and the dblink extension together with Redshift to support this type of workload. This solution creates a direct connection between Postgres and Redshift allowing you to move data directly between the two systems, without any external ETL processes.
I was fortunate to be able to present on this topic at re:Invent 2021 through the AWS Community Builder program. You can find the slides from the talk which dive into more detail about how this pattern works below!
We have been running this solution now in production for around 2 years running over 20M upserts/day. Here are some stats and takeaways:
- We've seen a 90% reduction in response time in our non-analytical queries
- Postgres database is currently holding 3B rows / 4TB of data
- Total cost for the workload is $600/mo, around 70% cheaper than using DynamoDB for this workload which was the other datastore we considered for this system.
- Since Redshift originated from Postgres, the existing application we had was able to be run directly against Postgres with only a minor refactor.
Overall we have been pleased with the reliability of this solution for our application. If you have experience working with dblink or similar workloads I'd love to hear about it - feel free to reach out on Twitter!