
Developing machine learning solutions necessitates the allocation of resources such as databases and compute clusters. Traditionally, setting up these resources was done manually, leading to a higher risk of human error and making it more difficult to redeploy infrastructure quickly.
Infrastructure as Code (IaC) offers a method to create and manage a project's infrastructure resources. With infrastructure defined in files, its setup is automated and version-controlled. This approach minimizes errors and enables environments to be replicated quickly and infrastructure to evolve seamlessly.
Although widely adopted in web and data engineering, IaC is less prevalent in machine learning projects. Applying IaC to define different data storage services, model training environments, and scalable infrastructure for deploying models ensures control over components, costs, and importantly, their scalability and adaptability.
Using IaC effectively requires proficiency with tools like Terraform and adherence to their best practices. Infrastructure as code should be maintained with the same attention to detail and quality as application code.
OUR PERSPECTIVE
We advocate for the use of Infrastructure as Code in machine learning projects. This method offers a more agile, scalable, and efficient way to manage infrastructure, facilitating quicker deployments and enhanced consistency. IaC also improves security and maintenance, which are crucial in ML projects.

Modern data platforms now enable the creation of highly scalable infrastructures using serverless components that operate on a pay-as-you-go model. Services like BigQuery on Google Cloud Platform (GCP) and Synapse Analytics on Microsoft Azure allow data processing and analysis on demand, eliminating the need to manage complex underlying infrastructure.
Serverless platforms provide enhanced scalability, automatically adjusting to fluctuating demand. They also reduce upfront costs by eliminating the need for hardware or software license investments, making them particularly beneficial for Small and Medium Enterprises (SMEs) looking to quickly launch their data platform. Additionally, they simplify operational management through tools like PyAirbyte or DLT (open-source data extraction tools) that run on Cloud Run (GCP’s managed container deployment service), automating data integration processes.
However, these solutions come with challenges:
Theodo’s point of view
We recommend that SMEs looking to develop their data platform adopt serverless cloud solutions for their flexibility and fast deployment. However, it is crucial to complement this adoption with a FinOps strategy and rigorous data governance to control costs and ensure data security and quality.

Setting up data platforms is often complex, costly, and time-consuming. Despite these investments, the ROI of such platforms is frequently low. Dashboards created for business teams often lack relevance, leading users to gradually abandon them due to a lack of trust in the data.
To build effective and sustainable tools, the Data Platform as a Product (DPaaP) approach treats the data platform as a product, placing user experience at the core of its design. This means developing dashboards and tables while considering business teams’ needs, concerns, and preferences. The focus is on:
The DPaaP approach helps create data platforms tailored to business needs, increasing adoption and long-term value. Close collaboration with end users ensures relevance, reliability, and usability of the data provided.
However, this user-centric approach requires time and multiple iterations to meet expectations that tech teams might find unnecessary, such as UI improvements. Additionally, ongoing monitoring and adjustments to accommodate business changes can be costly.
Theodo’s point of view
At Theodo, we use the DPaaP approach to build data platforms that offer a smooth and intuitive user experience. This approach engages users, makes them more accountable for data quality, and helps transition organizations toward a data mesh model. We highly recommend this method for companies struggling with "shadow data" or facing challenges in engaging their business teams.

As climate change becomes a major concern, measuring and reducing the carbon footprint of our activities is essential. Cloud Carbon Footprint is an open-source library that enables carbon tracking across the three major cloud providers (Google Cloud, AWS, Azure), using billing and consumption data.
In Data Engineering and Cloud Computing, accurately measuring the carbon impact of infrastructure and energy usage is challenging.
A comprehensive estimation of a cloud provider’s impact includes three scopes, as defined by the GHG Protocol:
Cloud providers offer their own carbon impact calculators, but they have limitations:
Cloud Carbon Footprint stands out by using a public methodology and emission coefficients, making it an independent standard that is continuously updated by the open-source community. However, its coefficients are four years old, originally measured by Etsy under the name "Cloud Jewels", making it difficult to assess the accuracy of its estimations.
Theodo’s point of view
Despite outdated coefficients, Cloud Carbon Footprint remains the best available solution for tracking cloud emissions, as cloud provider reports lack transparency. Its setup is quick, thanks to clear documentation. We recommend it for monitoring the carbon footprint of your cloud infrastructure.

When launching a Machine Learning project, there are two main options for the technical stack. The first is using an end-to-end ML platform, where pre-built, tested components save time.
However, this approach comes with the typical drawbacks of managed solutions: higher costs, black-box functionalities, limited customization, restricted integration with other tools, and vendor lock-in. The second option is to use open-source tools and custom code to build a tailor-made stack, avoiding the pitfalls of managed solutions but requiring an initial investment in selecting and setting up the necessary components.
To simplify this second approach, we developed Sicarator, a project generator that allows users to quickly set up a high-quality ML project with the latest open-source technologies.
Initially created in 2022 for internal use, Sicarator became open-source a year later after proving its efficiency across more than twenty projects.
By following a command-line interface, users can generate a project structure that follows best practices, including:
The generated code includes documentation, ensuring a smooth user experience. The tool is designed with a code-centric approach, maximizing control for data scientists and ML engineers. It evolves to reflect best practices in the ecosystem—for example, Ruff has recently replaced PyLint and Black as the linter/formatter.
However, Sicarator does not provide the full-fledged automation of advanced platforms, requiring additional manual setup. For instance, at this stage, it does not include automated training instance deployment.
Theodo’s point of view
We recommend adopting this approach in environments where collaboration between technical and business teams needs to be streamlined. Adding Analytics Engineers helps improve data model quality, optimize analytical processes, and increase operational efficiency while addressing challenges in collaboration between Data Engineering and Data Analytics teams.

Data has long been a critical strategic asset for businesses. However, the growing complexity of information systems and business processes often makes collaboration between business and technical teams challenging. Event Storming, a collaborative and visual method derived from Domain-Driven Design, proves particularly useful for designing a data model that aligns with both business needs and analytical requirements.
The Event Storming workshop is structured to foster a shared understanding of the system. It begins with the collection of key business events, where each participant contributes by placing sticky notes on a wall to represent these events chronologically. This phase is followed by identifying actors, commands, and aggregates, helping to build an overview of the domain. This interactive approach enhances communication and highlights areas of complexity or uncertainty.
Although rarely used in data analysis, it helps map business events to identify the necessary data for analysis and decision-making. This process provides several specific benefits for data modeling:
However, this method may be less suitable for simpler environments, where an approach like Design Thinking might be more appropriate. It also requires careful preparation and an experienced facilitator to maximize its effectiveness.
Theodo’s point of view
We recommend Event Storming for effectively modeling complex data systems and enhancing collaboration between business and technical teams. As an alternative, for more detailed process modeling, we suggest using BPMN, which provides a more precise representation of workflows.

Elementary is an innovative open-source solution addressing a critical need in data engineering: ensuring data reliability. With the growing adoption of dbt in recent years, a gap has emerged in data monitoring and quality management. Elementary fills this void by providing a system for tracking and improving data quality, helping eliminate the risks of inconsistent or incomplete data.
Before Elementary, data quality management relied on custom scripts or dedicated tools like Great Expectations, which were often complex and costly. Elementary simplifies this process through its native integration with dbt, making it easier to implement quality checks within ETL (Extract, Transform, Load) workflows. While dbt lacks a built-in interface for tracking data trends over time, Elementary stands out with its history tracking and visualization layer, offering an intuitive dashboard that is easy to install and deploy. A single scheduled command can generate automated reports, making the process transparent and accessible to data analysts, analytics engineers, and business teams.
Additionally, Elementary includes advanced features such as anomaly detection, adding another layer to data monitoring. However, these features may require adjustments depending on specific use cases.
Theodo’s point of view
Theodo recommends Elementary for its ease of integration, especially for companies already using dbt and looking to strengthen data quality. Its active community ensures quick access to support and continuously evolving features. However, as an emerging technology, it is evolving rapidly, which may present an adaptation challenge for some organizations.


SQLFluff is an open-source SQL linter and formatter designed to improve code quality by enforcing consistent coding standards and detecting potential errors, regardless of the SQL dialect used. Before SQLFluff, static code analysis for SQL was often overlooked, with developers relying on manual reviews to ensure consistency and quality—an inefficient and error-prone process. SQLFluff addresses this gap by providing an automated tool for linting and formatting, making SQL code easier to maintain and fostering better collaboration.
The key advantages of SQLFluff include its high flexibility to adapt to project-specific coding standards and its support for multiple SQL dialects. Its auto-correction feature can speed up development by automatically fixing style and syntax issues.
However, SQLFluff has some limitations. Auto-correction can sometimes introduce changes that break SQL code, especially with complex or non-standard constructs. Additionally, when linting large files or many files at once, the tool can become slow, impacting productivity. The initial configuration may also be challenging, with potential conflicts between rules requiring careful customization.
Theodo’s point of view
We recommend using SQLFluff, as having a linter is essential for maintaining SQL code quality. However, it is important to be aware of its limitations. Use auto-correction carefully, optimize linting for large files, and ensure a well-configured setup to maximize its benefits.

Sifflet is a data observability platform launched in 2021. It helps organizations monitor and improve the quality, reliability, and traceability of their data across pipelines. By providing key indicators such as the number of null values, table volume, and the uniqueness of identifiers, Sifflet ensures robust and reliable data governance.
What sets Sifflet apart is its no-code interface, making it easy to use. Creating data quality rules is intuitive, and real-time notifications are sent via email, Teams, or Slack in case of non-compliance. The platform also enhances data accessibility through its catalog and data lineage visualization. It integrates with a wide range of solutions, including data warehouses, data lakes, BI tools, ELT/ETL solutions, and data orchestrators.
However, as a relatively new tool, Sifflet has some limitations. Certain integrations, such as Dagster or Google Chat, are not yet supported. Using "Monitors As Code" can make debugging more complex due to its YAML format. When dealing with large numbers of tables, creating rules can become time-consuming. Additionally, monitoring options are limited, restricting visualization customization. Lastly, pricing is only available upon request, which may be a barrier for some organizations. Alternatives like Monte Carlo or Elementary may also be worth considering.
Theodo’s point of view
We recommend Sifflet for its ease of use, particularly for organizations with low technical expertise, a relatively simple data platform, but strict data quality requirements. However, its high cost and limited functionalities may be obstacles to adoption.

Using production data in pre-production or development environments is a common practice in data projects. This approach helps simulate real conditions and improve the quality of testing and development. Teams can assess application performance, making it easier to adjust and optimize before deployment. The main advantage is that it enhances development reliability by quickly identifying bugs and enabling data-driven decisions.
However, this technique raises serious security and isolation concerns. If non-production environments are not properly isolated from the production network, there is a risk of compromising production data. Additionally, direct access to sensitive data without proper protection violates regulations such as GDPR, and standards like ISO 27001 or SOC2. To ensure compliance, it is crucial to protect data confidentiality through anonymization techniques.
These practices are particularly complex and require balancing risks with an appropriate level of security, depending on the project’s specific needs. A common alternative is using synthetic data, which eliminates security and isolation concerns by avoiding access to sensitive production data. However, this approach has limitations in terms of representing real-world scenarios and can be time-consuming to implement, reducing overall effectiveness.
Theodo’s point of view
At Theodo, we firmly believe that sensitive production data should not be used in development environments. Given the lack of ideal anonymization tools, we are successfully experimenting with synthetic data generation models (GenAI), allowing us to iterate quickly while maintaining a high level of security and data isolation.

"Data platform on-premise" refers to solutions where physical servers are installed either within a company’s premises or in a third-party data center. Unlike cloud-based systems provided by vendors, an on-premise platform requires full control over the physical infrastructure, comprehensive security management, and direct responsibility for software deployment.
Before the rise of cloud computing, all businesses relied on on-premise servers, but this approach involved heavy management overhead. Cloud providers later introduced managed services with a pay-as-you-go model, reducing upfront investment costs and offering greater agility.
Today, while cloud solutions dominate, some industries such as healthcare, finance, and government continue to prefer on-premise platforms to ensure data confidentiality, sovereignty, and full customization of software and hardware.
However, on-premise platforms come with several disadvantages:
Companies like Cloudera, Oracle, and IBM simplify on-premise deployments by handling infrastructure, maintenance, and software compatibility. However, these solutions remain more complex to implement than their cloud counterparts and involve additional costs in the form of subscriptions, licenses, and installation fees.
Theodo’s point of view
At Theodo, we work with clients in the banking and healthcare sectors on on-premise platforms. These projects tend to be long-term and require experienced professionals. We only recommend on-premise solutions for companies with strict regulatory or technical requirements that justify this choice.

Kibana is a monitoring tool primarily used for analyzing and visualizing logs and metrics. It is an integral part of the Elastic Stack (formerly ELK Stack), alongside Elasticsearch and Logstash. Kibana allows users to create charts, interactive dashboards, and reports based on data indexed in Elasticsearch, providing a powerful interface for exploring and analyzing logs in real-time.
While Kibana is effective for log management, using it for data quality monitoring comes with several limitations:
For more efficient data quality monitoring, alternatives like Elementary, which automatically visualizes dbt tests, or Sifflet, with its ready-to-use dashboards, may be more suitable. For customized tracking, a BI dashboard can also be a viable option.
Theodo’s point of view
A few years ago, we used Kibana for data quality monitoring. While it was effective for simple metrics, it proved too limited for more complex needs. We therefore recommend Kibana only for application log analysis and visualization.

Great Expectations is an open-source Python framework designed for evaluating and monitoring data quality. Its integration with most market tools is straightforward and fast, thanks to its compatibility with various data sources and orchestration tools. Additionally, the community actively contributes to the development of packages that provide predefined quality checks.
Great Expectations can connect to major database technologies (such as PostgreSQL, BigQuery) and storage solutions (like AWS S3, Google Cloud Storage). It also offers a logging and alerting system for quality check results, providing better visibility into data quality status.
However, Great Expectations has several significant limitations. Its learning curve is steep due to complex concepts that are not easy to grasp. Creating custom quality checks is often challenging and unintuitive, especially given that its documentation is insufficient for easing adoption. Additionally, the tool lacks robust features for cross-table validation, limiting its applicability. Finally, the user experience of its quality reports is suboptimal, making it difficult to track quality trends over time.
Theodo’s point of view
We do not recommend using Great Expectations due to its limitations, which make it difficult to adopt and scale effectively. For data quality monitoring, we suggest considering alternatives like Elementary, particularly if your pipeline relies on dbt.
