
Replacing Hadoop with a Data Warehouse built on top of Trino, built with an autoscaling microservice architecture to handle finance data of millions of German customers.
Case Study: Revolutionizing Data Management for Atruvia with Open Source Solutions
Client: Atruvia (IT Provider for Volksbank and Sparkasse)
From Hadoop to Trino: Modernizing Data Warehousing for Cost-Effective, Scalable Solutions
At Atruvia, we undertook a transformative journey to replace an expensive Hadoop infrastructure with a cutting-edge data warehouse built on Trino and S3 autoscaling clusters. The goal was to create a scalable, cost-efficient, and user-friendly solution, ensuring compliance with BaFin regulations while optimizing data performance for end-users.
Objective
To build a scalable, efficient, and cost-effective data warehouse solution that meets regulatory compliance standards and enhances data performance for analytics teams.
Solution Design Process
- Requirement Analysis
- Engaged with Atruvia’s IT and analytics teams to identify needs such as cost reduction, data scalability, and ease of use.
- Highlighted regulatory requirements, ensuring all solutions complied with BaFin guidelines.
- Technology Evaluation
- Selected Trino for its advanced SQL query capabilities and seamless integration with S3 autoscaling clusters for efficient, distributed data storage.
- Prioritized open-source technologies to cut licensing costs while maintaining high performance.
- Architecture Design
- Built a microservice-based architecture using OpenShift, hosting the entire data warehouse and analytics environment.
- Implemented S3 autoscaling clusters for cost-effective, scalable storage, replacing traditional databases.
- Developed a BaFin-compliant framework to manage security and regulatory needs.
Implementation
- Infrastructure Setup
- Deployed Trino and S3 clusters in the OpenShift environment for high availability and performance.
- Configured a robust microservice architecture to handle data ingestion, processing, and querying at scale.
- Data Migration
- Seamlessly migrated data from the Hadoop infrastructure to the new Trino-S3 setup, ensuring data integrity and regulatory compliance.
- User Tools & Training
- Delivered pre-configured Jupyter Notebook environments for analytics teams, enabling them to analyze and visualize data effortlessly.
- Integrated interactive dashboards for real-time insights, reducing reliance on complex configurations like PySpark.
- Provided training and ongoing support to ensure users maximized the benefits of the new environment.
Results
✅ Cost Reduction
Replaced expensive Hadoop infrastructure with an open-source solution, significantly cutting data management costs.
✅ Improved Scalability & Performance
Enabled seamless handling of massive datasets using Trino and S3, with real-time processing across distributed systems.
✅ Regulatory Compliance
Built a secure, BaFin-compliant framework for reliable data management.
✅ Enhanced User Experience
Empowered analytics teams with intuitive tools, eliminating the need for complex workflows and enabling faster, more efficient insights.
This project revolutionized Atruvia’s data management capabilities by combining the power of Trino, S3 autoscaling clusters, and a user-centric approach. The result? A cost-effective, scalable, and compliant data warehouse that empowers teams to derive actionable insights without technical bottlenecks.
Ready to Transform Your Data Infrastructure?
Let us help you build a modern, scalable, and compliant data management solution tailored to your needs. Visit DataProfitExpert.com or contact us today!
Full Transcript
In my project with two of Germany’s largest banks, the Sparkasse and Volksbank, I’ve been working on a data warehouse project. And actually, they did quite well already because they were using Hadoop as big data distributed storage system, which is, as you might know, different from traditional databases like Oracle SQL or Microsoft SQL, because they are not meant to store a lot of data and especially not for analytical workloads where you’re calculating over different columns, like how much revenue did we generate in region X and Y. And usually this tends to be really, really slow with SQL databases. And this is why you would use something like Hadoop to increase this analytical processing engine. Also, a nice thing is of a distributed system like Hadoop or the latest system we implemented, that it’s not running on one huge vertical server, meaning you don’t need these expensive, expensive machines with 200 gigabytes of RAM.
But instead, you take a lot of smaller consumer-grade servers that are all linked together and just workload is distributed to all these tiny little cheap servers. And something they were facing is that Hadoop actually increased their support prices a lot in the recent times, a little bit similar to what you might know of VMware’s VM solution and so on. But the thing is, they wanted to switch to an open source system. And actually what we did is using Kubernetes, or in their case for managed Kubernetes OpenShift as basis to use Trino as a distributed storage engine and the data being saved as parquet files in different S3 buckets. As you might know from AWS S3, it’s an object storage. And the smart thing is objects are really easy to replicate over different servers. Meaning we have three copies of data in three different servers, but all randomly mixed that we can basically prevent one server burning down, causing a huge data loss, but instead we kind of split up the data through our distributed systems. Therefore, we are prone or we are safe from data loss accidents. And like I said, we can do the calculations not on one server, but on many, many servers simultaneously.
And as you might know, SQL is not really meant for a large amount of data, but an object storage is. And using tools like, for example, MinIO, we can replicate this AWS S3 experience and have a distributed storage system with the Trino querying engine on top, meaning we can not only connect the S3 buckets for big data, but we can also collect traditional old databases in there. It’s kind of like an aggregator for different SQL queries. And the nice thing in the end is we will have one backend to connect to all our different data sources, which can be anything. Like it could even be, I don’t know, even email saved into a bucket, PDF files, images, and so on. And we can actually query information from all these different databases via this Trino connection. And that is what I did in this project with the Sparkas and Volksbank. And it actually is used a lot nowadays, especially for analyzing bank account statements for transactions, like detecting, for example, errors in transactions, prevent credit card fraud and so on.
And this can all be done with one system, which is a huge game changer, especially in large enterprises where you’re usually having like millions of different data sources. No one knows what’s happening. And especially for that, a data catalog comes in play, which is kind of a registry of like your Trino database, but not only with technical links like the API calls required to connect to them, but also with meta information about the company. Like you can tag different storage or different tables as, for example, marketing data or department data. Let’s say you have a specialized department in plants and you can tag different data sources as plant-based.
You can add descriptions, you can kind of have an interactive documentation about your data. And using every IC, role-based access control, together with these kind of data catalog and data warehouse, you can even trim permissions down to the minimum, meaning your plant department will only be able to access plant data sources. And you can even add tiny little additions like, I don’t know, the plant department can only see pseudonymized or anonymized versions of these tables, which is all done on the fly by, for example, this Trino engine. And this opens a huge range of possibilities. Meaning, for example, if you’re working with sensitive data, like a clinic I’ve been working with, you can only limit access to the real true source data, for example, to the research department and publish your data basically for everyone in an anonymized way.
Meaning, for example, marketing can see all the data, but just anonymized, which is done on the fly automatically. And your research department will be able to see everything. And that comes down to what a good data warehouse is. Like I said, it’s an aggregation of different data sources. It’s using high-tech tools to, for example, save license costs, I don’t know, and do not use these huge Oracle servers that are expensive. And to kind of act like a data catalog that everyone knows what data we have and what can we do. And yeah, if you’re interested about something similar like that, you can either look around on my blog because I have a lot of tutorial articles or you can book one or consultation session and that one. Yeah, you can basically find everyone’s website. OK, I wish you good luck and see you soon.