
The medallion architecture is a framework introduced by Databricks to structure data flows in Data Lakes and better separate data quality cycles. This structure consists of three successive layers of transformations:
This simple concept significantly improves the quality of data transformation pipelines, whether within a data platform or even in a large Excel file.
By ensuring a clear separation between raw data (bronze) and consumed data (gold), it allows for high scalability of data flows. The medallion architecture encourages limiting the responsibilities of each table, making it easier to understand and modify calculation rules or even migrate data sources.
However, in a medallion architecture, data is often duplicated (raw, cleaned, filtered…), which can lead to significant costs for organizations already storing large volumes of data. Some pipelines may be less optimized compared to a single-step process, and the increase in tables and dependencies can lengthen workflow execution times. That said, these costs are generally offset by the savings in manual labor time.
Ultimately, this framework has become an industry standard, much like the staging/intermediate/datamart model promoted by dbt.
Theodo’s point of view
We strongly recommend using the medallion architecture in your data projects to ensure scalability and facilitate collaboration. At Theodo, we also adapt this framework by further segmenting each layer into multiple quality levels to maximize its benefits.

AWS Lambda and Google Cloud Run Functions are serverless computing services that allow code execution in response to events without provisioning underlying infrastructure. These solutions enable developers to focus on business logic rather than server management. They are ideal for data transformation pipelines, particularly when workloads are intermittent or unpredictable. These functions are automatically triggered by events such as HTTP requests, database modifications, or file uploads to cloud storage.
One of the main advantages of Lambda and Cloud Run Functions is their pay-as-you-go pricing, which charges only for execution time, reducing infrastructure costs. Additionally, these services offer automatic scalability, adjusting resources dynamically based on demand without manual intervention. They also simplify maintenance with integrated monitoring and logging tools to track performance in real-time and troubleshoot proactively.
However, these functions have some limitations. Execution time is often restricted to a few minutes, making it difficult to process large data volumes in a single run. Additionally, the temporary container used to execute functions can introduce latency due to cold start times.
Theodo’s point of view
At Theodo, we use AWS Lambda and Google Cloud Run Functions to efficiently and scalably execute data transformation pipelines. We recommend these technologies for short, autonomous, and reactive tasks that require on-demand execution and optimized cost management.



Its main strength lies in managing table dependencies (through the declaration of references and sources), refactoring via macros, and integrating documentation. dbt also allows the definition of unit tests that help validate the proper execution of standard SQL queries. With this feature, it is possible to simulate data using CSV files, known as seeds, and compare transformation results against expected outcomes. This helps establish and maintain best development practices directly within SQL queries over time. This is a key differentiator compared to other solutions like Google DataFlow or AWS Data Pipeline.
However, creating these unit tests comes with some challenges to keep in mind:
Theodo’s point of view
We recommend dbt for building robust and maintainable pipelines, thanks to its unit testing capabilities that support continuous development and prevent legacy code buildup. However, for massive data processing or highly specific use cases, tools like Apache Spark or DataFlow may be more suitable, even though dbt stands out for its best practices.

As companies increasingly rely on fast and efficient data processing to drive their decisions, they seek to optimize performance. Managing large volumes of data from various sources has become a major challenge. Dataflow is a fully managed GCP service that addresses these challenges by providing a scalable and reliable platform for batch and streaming data processing. Dataflow is built on the open-source Apache Beam programming model, allowing developers to define data processing pipelines that are infrastructure-agnostic and can be deployed across different execution environments.
The key strengths of Dataflow include its ability to handle large datasets and process streaming data with low latency. As a managed service, it removes the need for server configuration, while its auto-scaling capabilities help optimize costs without compromising performance. Dataflow is particularly well suited for scenarios requiring robust data integration and real-time analytics capabilities, such as:
Despite its advantages, Dataflow can be complex to configure and optimize, especially for users unfamiliar with Apache Beam. Additionally, it can generate significant costs at scale, particularly for high-throughput streaming applications.
Theodo’s point of view
At Theodo, we see Dataflow as a powerful option for companies looking for a scalable, robust, and managed solution for complex batch and streaming data processing tasks. However, a steep learning curve is required for those unfamiliar with Apache Beam.
MDN’s point of view
Dataflow requires Apache Beam to implement workflows, using a programming model less SQL-oriented than Spark, and offers fewer memory management options compared to Spark or Flink. However, it remains easier to use and provides good machine learning capabilities, thanks to GPU-powered instances, making it a strong distributed computing tool.

Managing data workflows is essential for data scientists and involves processes such as data preparation and model building pipelines. The complexity of such management has highlighted the inadequacies of traditional orchestration tools like CRON. To address these challenges, Airbnb developed Airflow in 2014. Airflow is an open-source Python library designed for task orchestration, enabling the creation, deployment, and monitoring of complex workflows.
Airflow represents complex data workflows as directed acyclic graphs (DAGs) of tasks. It acts as an orchestrator, scheduling tasks based on their interdependencies while offering a user-friendly web interface for workflow visualization. The library's flexibility in handling various task types simplifies the automation of data processing tasks, contributing to Airflow's popularity in contemporary data management.
For data scientists, setting up workflows with steps like data preprocessing, model training, and performance evaluation can become cumbersome with intricate Bash scripts, which are hard to maintain. Airflow provides a more maintainable solution with its built-in monitoring and error handling capabilities.
While Airflow is a popular choice, there are alternatives suited to specific needs. For instance, Dagster facilitates direct data communication between tasks without the need for an external storage service. Kubeflow Pipelines offers specialized ML operators and is geared towards Kubernetes deployment but has a narrower community due to its ML focus. Meanwhile, DVC caters to the experimental phase, providing pipeline definitions and integration with experiment tracking, though it may not be ideal for production environments.
OUR PERSPECTIVE
We recommend Airflow for the robust orchestration of diverse tasks, including production-level Machine Learning pipelines. For developmental stages and model iteration, tools like DVC are preferable due to their superior experiment tracking features.


Popsink is a next-generation ELT tool that adopts a CDC-first (Change Data Capture) approach, optimized for real-time processing. This solution aligns with major industry players by offering flexible data source management (transactional databases, SaaS tools, data warehouses...) and simplified integration into various destinations beyond traditional data warehouses like BigQuery and Snowflake.
Among Popsink’s key advantages are incremental capture capabilities, schema evolution management without breaking changes, and highly competitive costs. Real-time processing consumes fewer resources than batch processing or periodic full refreshes.
Theodo’s point of view
Real-time processing remains a challenge for data teams and platforms. Popsink is a plug-and-play alternative to high-maintenance technologies or expensive SaaS solutions. If your needs align with the available connectors, don’t hesitate to test it. Plus, it’s 100% French. For us, Popsink is already an essential tool, especially for legacy-to-cloud migrations.


A data architecture defines how data is collected, transformed, stored, and exposed to meet various business needs, both short- and long-term, while adhering to governance requirements. Choosing the right architecture is essential to ensure high availability, persistence, optimized storage, and efficient processing of diverse data volumes.
The Lambda architecture is a hybrid model, divided into two flows to handle both real-time and batch data processing.
The Serving Layer, or exposure layer, plays a key role in making data accessible to systems and end users. It aggregates results from both processing layers, delivering data that is both fresh and reliable while enabling fast and optimized queries. While this architecture is powerful, it can be complex to implement due to the simultaneous management of streaming and batch processing.
Among our clients, the Lambda architecture has provided real-time access to information while ensuring daily data updates. It meets governance needs, deduplication, and data cleansing requirements.
Theodo’s point of view
We recommend the Lambda data architecture when high availability and daily processing requirements need to be combined. However, its implementation requires significant effort and incurs maintenance costs that must be carefully considered—ensuring that the investment aligns with the business need is crucial!

Pandas has long been the preferred toolkit for data manipulation and analysis in Python. Its intuitive interface and comprehensive set of features have made it indispensable for data practitioners. However, Pandas encounters performance bottlenecks when handling very large datasets, primarily because its operations are not designed to be parallelized, and it requires data to be loaded entirely into memory.
Enter Polars, a modern DataFrame library that leverages the Rust programming language, introduced to the public in 2021. Polars is engineered to overcome the scalability issues of Pandas by supporting multi-threaded computations. This capability, combined with lazy evaluation strategies, enables Polars to efficiently manage and transform datasets that exceed available memory, enhancing performance significantly.
Polars is designed with an intuitive syntax that mirrors Pandas, making the transition between the two libraries smooth for users. This design choice ensures that data professionals can apply their existing knowledge of Pandas to Polars with minimal learning curve, facilitating adoption.
Despite its advantages, Polars is comparatively newer and thus may not offer the same breadth of functionality as Pandas. However, Polars integrates seamlessly with the Arrow data format, which simplifies the process of converting data between Polars and Pandas. This compatibility allows users to leverage Polars for performance-critical tasks while still accessing Pandas' extensive feature set for specific operations.
Our Perspective
Given the performance benefits and ease of use, we advocate for adopting Polars in new projects that involve DataFrame manipulation, reserving Pandas primarily for maintaining existing codebases. This strategy allows for leveraging the strengths of both libraries—utilizing Polars for its efficiency and scalability, and Pandas for its established ecosystem and rich functionality.

Dataform was created in 2018 as an open-source framework to simplify the creation, execution, and orchestration of SQL workflows on BigQuery, Snowflake, Redshift, and Synapse. Since its acquisition by Google Cloud in 2020, it has been optimized and integrated into BigQuery. Like dbt, Dataform allows users to declare sources, define transformations, set up data quality tests, and document everything using a JSON-like syntax. Additionally, Dataform provides an integrated IDE for lineage visualization, compilation, testing, and live deployment.
Dataform stands out due to its native integration with Google Cloud, making interactions with other services seamless. The tool is included for free with BigQuery, with only compute costs being charged. Its setup and environment management are extremely simple. Its templating system in JavaScript reduces code duplication with great flexibility, although JavaScript is less commonly used by data engineers.
These features make Dataform a serious competitor to dbt, but the tool still lacks maturity. The developer experience is less refined due to the inability to test locally and the limited Git integration in the IDE. Although documentation is available, it remains difficult to grasp, and community support is limited compared to dbt. Furthermore, while dbt integrates well with many open-source technologies like Elementary, Airflow, or Airbyte, Dataform lacks similar tools within the Google Cloud ecosystem.
Theodo’s point of view
Although Dataform offers strong integration within the Google Cloud ecosystem, it has limitations compared to more mature alternatives like dbt. The weaker developer experience and smaller ecosystem make Dataform less appealing for large-scale or scalable projects. Apart from small teams working exclusively with BigQuery, we recommend favoring dbt for its completeness and scalability.

Over the past few years, Snowflake has introduced several key features to orchestrate data transformation workflows directly within the platform. With the introduction of Tasks in 2019 and DAGs in 2022, it is now possible to create pipelines of orchestrated SQL commands directly in Snowflake. This eliminates the need for an external orchestrator, which can be costly and require additional configuration to access data within Snowflake.
In 2022, Snowflake also launched Snowpark, an API that allows large-scale data processing within Snowflake using Python, Java, and Scala. With an interface similar to Spark, Snowpark overcomes SQL’s limitations while benefiting from distributed computing on Snowflake’s infrastructure. Thanks to Stored Procedures in Python, Java, or Scala, these languages can also be integrated into Snowflake DAGs.
In 2023, the introduction of Logging and Tracing for Stored Procedures enabled pipeline monitoring, a crucial feature for ensuring the stability of a production environment.
However, developing an ETL pipeline natively on Snowflake presents some limitations, particularly when compared to more mature ETL tools like DBT or Airflow.
Despite these challenges, the Snowflake ecosystem is evolving rapidly and could become a robust Data Engineering service in the coming years.
Theodo’s point of view
Today, we recommend using Snowflake primarily as a powerful SQL engine while orchestrating transformations with more mature external services (such as a DBT/Airflow stack). However, in cases where industrialization requirements are low or if maintaining a single, unified platform is a priority, Snowflake’s ETL tools can be sufficient.

Spark is an open-source solution for processing large volumes of data by parallelizing computations. Before 2009, the standard method for data transformation in the Hadoop ecosystem was MapReduce. Spark was introduced in 2009 and quickly established itself as a faster alternative, eventually evolving into an independent distributed computing system beyond Hadoop.
The main strength of Spark lies in its execution model. Users define the transformations they want to apply to their data using a domain-specific language (DSL), and at runtime, Spark automatically constructs an optimized execution plan. For example, Spark can push a filter operation higher up in the transformation chain to load only the necessary data, improving efficiency.
However, this strength is also a limitation, as it requires learning a specific DSL. For deeper optimizations, a solid understanding of how Spark builds its execution plan is necessary.
One of the key challenges is the shuffle mechanism, which moves data between cluster nodes before processing it. Additionally, Spark’s distributed nature makes it relevant only for large-scale datasets, where parallel execution across multiple machines is beneficial. For instance, processing a CSV file with a few thousand rows can take several minutes with Spark, whereas pandas would handle it in just a few seconds.
Theodo’s point of view
We recommend choosing Spark only if you require high performance for complex transformations involving large data volumes (several terabytes). If this is not the case, it is better to use the SQL query engine of your data warehouse, avoiding the need to develop expertise in a complex technology.

