Theodo

Retour Vue d’ensemble

Co-créé avec notre partenaire :

Actualisation du radar :

No items found.

Kibana for Data Quality

Great Expectations

SQLFluff

Sifflet

Prod data in dev

On Premise Data-Platform

Event-Storming for Data

Elementary

Data Contracts

uv

Infrastructure as Code

Data Platform Serverless

Data Platform as a product

Cloud Carbon Footprint

Analytics Engineers

Hold

Assess

Trial

Adopt

Adopt

45

uv

Developed by Astral, the creators of the Ruff linter, uv is an open-source Python project and package manager designed to accelerate and simplify Python tooling.

Beyond this simplification, uv enhances the developer experience in several ways. First, it can be installed and used with a single command, without requiring Rust or Python to be pre-installed.

Python itself can be installed via uv in a single command, similar to pyenv. The creation of virtual environments is very similar to poetry, with an optimized dependency resolution process that makes it significantly faster.

Thanks to this speed, the uv run command automatically rechecks and recalculates dependencies every time a Python command is executed. This reduces errors when dependencies are updated by other team members. As a result, the traditional approach of activating virtual environments with pyenv local becomes less relevant.

In a typical development workflow, uv streamlines processes and simplifies the ecosystem by serving as the only prerequisite for using Python with virtual environments. This makes project setup easier for newcomers. Additionally, while pyenv was not officially compatible with Windows, uv is natively supported, enabling a standardized development toolchain across all platforms.

Theodo’s point of view

At Theodo, we strongly recommend using uv for standard Python development workflows due to its quality, completeness, simplicity, and frequent updates. If you rely on poetry or pyenv features that are not yet available in uv, we suggest creating an issue on the project’s GitHub repository to contribute to its development.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

44

Infrastructure as Code

Developing machine learning solutions necessitates the allocation of resources such as databases and compute clusters. Traditionally, setting up these resources was done manually, leading to a higher risk of human error and making it more difficult to redeploy infrastructure quickly.

Infrastructure as Code (IaC) offers a method to create and manage a project's infrastructure resources. With infrastructure defined in files, its setup is automated and version-controlled. This approach minimizes errors and enables environments to be replicated quickly and infrastructure to evolve seamlessly.

Although widely adopted in web and data engineering, IaC is less prevalent in machine learning projects. Applying IaC to define different data storage services, model training environments, and scalable infrastructure for deploying models ensures control over components, costs, and importantly, their scalability and adaptability.

Using IaC effectively requires proficiency with tools like Terraform and adherence to their best practices. Infrastructure as code should be maintained with the same attention to detail and quality as application code.

OUR PERSPECTIVE

We advocate for the use of Infrastructure as Code in machine learning projects. This method offers a more agile, scalable, and efficient way to manage infrastructure, facilitating quicker deployments and enhanced consistency. IaC also improves security and maintenance, which are crucial in ML projects.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

43

Data Platform Serverless

Modern data platforms now enable the creation of highly scalable infrastructures using serverless components that operate on a pay-as-you-go model. Services like BigQuery on Google Cloud Platform (GCP) and Synapse Analytics on Microsoft Azure allow data processing and analysis on demand, eliminating the need to manage complex underlying infrastructure.

Serverless platforms provide enhanced scalability, automatically adjusting to fluctuating demand. They also reduce upfront costs by eliminating the need for hardware or software license investments, making them particularly beneficial for Small and Medium Enterprises (SMEs) looking to quickly launch their data platform. Additionally, they simplify operational management through tools like PyAirbyte or DLT (open-source data extraction tools) that run on Cloud Run (GCP’s managed container deployment service), automating data integration processes.

However, these solutions come with challenges:

Costs can increase unpredictably as data volumes grow, making it essential to implement a FinOps strategy to manage operational expenses.
Vendor lock-in may limit flexibility when migrating or renegotiating contracts with cloud providers.
Serverless services offer less control over infrastructure, restricting advanced optimization possibilities.

Theodo’s point of view

We recommend that SMEs looking to develop their data platform adopt serverless cloud solutions for their flexibility and fast deployment. However, it is crucial to complement this adoption with a FinOps strategy and rigorous data governance to control costs and ensure data security and quality.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

42

Data Platform as a product

Setting up data platforms is often complex, costly, and time-consuming. Despite these investments, the ROI of such platforms is frequently low. Dashboards created for business teams often lack relevance, leading users to gradually abandon them due to a lack of trust in the data.

To build effective and sustainable tools, the Data Platform as a Product (DPaaP) approach treats the data platform as a product, placing user experience at the core of its design. This means developing dashboards and tables while considering business teams’ needs, concerns, and preferences. The focus is on:

Improving UI for better usability
Simplifying data exploration to encourage adoption
Allowing users to trigger data refreshes via intuitive interfaces
Continuously monitoring usage, measuring satisfaction, and iterating based on user feedback

The DPaaP approach helps create data platforms tailored to business needs, increasing adoption and long-term value. Close collaboration with end users ensures relevance, reliability, and usability of the data provided.

However, this user-centric approach requires time and multiple iterations to meet expectations that tech teams might find unnecessary, such as UI improvements. Additionally, ongoing monitoring and adjustments to accommodate business changes can be costly.

Theodo’s point of view

At Theodo, we use the DPaaP approach to build data platforms that offer a smooth and intuitive user experience. This approach engages users, makes them more accountable for data quality, and helps transition organizations toward a data mesh model. We highly recommend this method for companies struggling with "shadow data" or facing challenges in engaging their business teams.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

41

Cloud Carbon Footprint

As climate change becomes a major concern, measuring and reducing the carbon footprint of our activities is essential. Cloud Carbon Footprint is an open-source library that enables carbon tracking across the three major cloud providers (Google Cloud, AWS, Azure), using billing and consumption data.

In Data Engineering and Cloud Computing, accurately measuring the carbon impact of infrastructure and energy usage is challenging.

A comprehensive estimation of a cloud provider’s impact includes three scopes, as defined by the GHG Protocol:

Scope 1: Direct emissions (e.g., generators, company vehicles).
Scope 2: Indirect emissions from purchased energy (e.g., electricity from various sources).
Scope 3: Indirect emissions from the supply chain (e.g., data center construction, hardware, waste management).

Cloud providers offer their own carbon impact calculators, but they have limitations:

AWS does not include Scope 3 and provides only a quarterly report with limited details.
Azure covers all three scopes but excludes certain regions (Germany, China, India), which may have higher emissions, and uses an opaque methodology.
Google Cloud provides detailed reports for all three scopes, but its emission coefficients are not publicly available.

Cloud Carbon Footprint stands out by using a public methodology and emission coefficients, making it an independent standard that is continuously updated by the open-source community. However, its coefficients are four years old, originally measured by Etsy under the name "Cloud Jewels", making it difficult to assess the accuracy of its estimations.

Theodo’s point of view

Despite outdated coefficients, Cloud Carbon Footprint remains the best available solution for tracking cloud emissions, as cloud provider reports lack transparency. Its setup is quick, thanks to clear documentation. We recommend it for monitoring the carbon footprint of your cloud infrastructure.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

40

Analytics Engineers

When launching a Machine Learning project, there are two main options for the technical stack. The first is using an end-to-end ML platform, where pre-built, tested components save time.

However, this approach comes with the typical drawbacks of managed solutions: higher costs, black-box functionalities, limited customization, restricted integration with other tools, and vendor lock-in. The second option is to use open-source tools and custom code to build a tailor-made stack, avoiding the pitfalls of managed solutions but requiring an initial investment in selecting and setting up the necessary components.

To simplify this second approach, we developed Sicarator, a project generator that allows users to quickly set up a high-quality ML project with the latest open-source technologies.

Initially created in 2022 for internal use, Sicarator became open-source a year later after proving its efficiency across more than twenty projects.

By following a command-line interface, users can generate a project structure that follows best practices, including:

Continuous integration with multiple quality checks (unit tests, linting, type checking)
Data visualization with a Streamlit dashboard
Experiment and data tracking, combining DVC and Streamlit for transparency and reproducibility

The generated code includes documentation, ensuring a smooth user experience. The tool is designed with a code-centric approach, maximizing control for data scientists and ML engineers. It evolves to reflect best practices in the ecosystem—for example, Ruff has recently replaced PyLint and Black as the linter/formatter.

However, Sicarator does not provide the full-fledged automation of advanced platforms, requiring additional manual setup. For instance, at this stage, it does not include automated training instance deployment.

Theodo’s point of view

We recommend adopting this approach in environments where collaboration between technical and business teams needs to be streamlined. Adding Analytics Engineers helps improve data model quality, optimize analytical processes, and increase operational efficiency while addressing challenges in collaboration between Data Engineering and Data Analytics teams.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Trial

48

Event-Storming for Data

Data has long been a critical strategic asset for businesses. However, the growing complexity of information systems and business processes often makes collaboration between business and technical teams challenging. Event Storming, a collaborative and visual method derived from Domain-Driven Design, proves particularly useful for designing a data model that aligns with both business needs and analytical requirements.

The Event Storming workshop is structured to foster a shared understanding of the system. It begins with the collection of key business events, where each participant contributes by placing sticky notes on a wall to represent these events chronologically. This phase is followed by identifying actors, commands, and aggregates, helping to build an overview of the domain. This interactive approach enhances communication and highlights areas of complexity or uncertainty.

Although rarely used in data analysis, it helps map business events to identify the necessary data for analysis and decision-making. This process provides several specific benefits for data modeling:

Identification of data needs through a deeper understanding of business flows.
Alignment of stakeholders using a common language, which is crucial for complex projects.
Detection of anomalies in existing processes, enabling operational optimization.

However, this method may be less suitable for simpler environments, where an approach like Design Thinking might be more appropriate. It also requires careful preparation and an experienced facilitator to maximize its effectiveness.

Theodo’s point of view

We recommend Event Storming for effectively modeling complex data systems and enhancing collaboration between business and technical teams. As an alternative, for more detailed process modeling, we suggest using BPMN, which provides a more precise representation of workflows.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

47

Elementary

Elementary is an innovative open-source solution addressing a critical need in data engineering: ensuring data reliability. With the growing adoption of dbt in recent years, a gap has emerged in data monitoring and quality management. Elementary fills this void by providing a system for tracking and improving data quality, helping eliminate the risks of inconsistent or incomplete data.

Before Elementary, data quality management relied on custom scripts or dedicated tools like Great Expectations, which were often complex and costly. Elementary simplifies this process through its native integration with dbt, making it easier to implement quality checks within ETL (Extract, Transform, Load) workflows. While dbt lacks a built-in interface for tracking data trends over time, Elementary stands out with its history tracking and visualization layer, offering an intuitive dashboard that is easy to install and deploy. A single scheduled command can generate automated reports, making the process transparent and accessible to data analysts, analytics engineers, and business teams.

Additionally, Elementary includes advanced features such as anomaly detection, adding another layer to data monitoring. However, these features may require adjustments depending on specific use cases.

Theodo’s point of view

Theodo recommends Elementary for its ease of integration, especially for companies already using dbt and looking to strengthen data quality. Its active community ensures quick access to support and continuously evolving features. However, as an emerging technology, it is evolving rapidly, which may present an adaptation challenge for some organizations.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

46

Data Contracts

With the exponential growth of data volumes and the increasing complexity of data ecosystems, Data Contracts have emerged as a crucial tool for improving governance and dataset management. These contracts formalize expectations regarding the structure, types, and constraints of data across various teams, contributing to clear and shared documentation.

Data Contracts go beyond defining exchange interfaces; they also enforce data quality by specifying precision, completeness, consistency, and evolution requirements. For example, a data processing batch can leverage Data Contracts to precisely understand the nature of expected data, optimizing reliability and reproducibility.

The Open Data Contract standard establishes clear specifications, facilitating cross-team collaboration and reducing uncertainties about data meaning and usage. These contracts are also essential in a Data Mesh architecture, as they promote standardized data discovery and accessibility across the information system, enhancing domain accountability.

In our projects, we have observed significant improvements in data quality and a reduction in incidents caused by misinterpretations between teams. Simple tools like Pydantic have proven useful for defining and validating schemas.

Theodo’s point of view

Data Contracts are a fundamental pillar for ensuring data quality and clarity, especially in complex and distributed environments. We recommend their adoption, particularly in Data Mesh contexts, to enhance standardization, data discovery, and cross-team collaboration.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

AssesS

52

SQLFluff

SQLFluff is an open-source SQL linter and formatter designed to improve code quality by enforcing consistent coding standards and detecting potential errors, regardless of the SQL dialect used. Before SQLFluff, static code analysis for SQL was often overlooked, with developers relying on manual reviews to ensure consistency and quality—an inefficient and error-prone process. SQLFluff addresses this gap by providing an automated tool for linting and formatting, making SQL code easier to maintain and fostering better collaboration.

The key advantages of SQLFluff include its high flexibility to adapt to project-specific coding standards and its support for multiple SQL dialects. Its auto-correction feature can speed up development by automatically fixing style and syntax issues.

However, SQLFluff has some limitations. Auto-correction can sometimes introduce changes that break SQL code, especially with complex or non-standard constructs. Additionally, when linting large files or many files at once, the tool can become slow, impacting productivity. The initial configuration may also be challenging, with potential conflicts between rules requiring careful customization.

Theodo’s point of view

We recommend using SQLFluff, as having a linter is essential for maintaining SQL code quality. However, it is important to be aware of its limitations. Use auto-correction carefully, optimize linting for large files, and ensure a well-configured setup to maximize its benefits.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

51

Sifflet

Sifflet is a data observability platform launched in 2021. It helps organizations monitor and improve the quality, reliability, and traceability of their data across pipelines. By providing key indicators such as the number of null values, table volume, and the uniqueness of identifiers, Sifflet ensures robust and reliable data governance.

What sets Sifflet apart is its no-code interface, making it easy to use. Creating data quality rules is intuitive, and real-time notifications are sent via email, Teams, or Slack in case of non-compliance. The platform also enhances data accessibility through its catalog and data lineage visualization. It integrates with a wide range of solutions, including data warehouses, data lakes, BI tools, ELT/ETL solutions, and data orchestrators.

However, as a relatively new tool, Sifflet has some limitations. Certain integrations, such as Dagster or Google Chat, are not yet supported. Using "Monitors As Code" can make debugging more complex due to its YAML format. When dealing with large numbers of tables, creating rules can become time-consuming. Additionally, monitoring options are limited, restricting visualization customization. Lastly, pricing is only available upon request, which may be a barrier for some organizations. Alternatives like Monte Carlo or Elementary may also be worth considering.

Theodo’s point of view

We recommend Sifflet for its ease of use, particularly for organizations with low technical expertise, a relatively simple data platform, but strict data quality requirements. However, its high cost and limited functionalities may be obstacles to adoption.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

50

Prod data in dev

Using production data in pre-production or development environments is a common practice in data projects. This approach helps simulate real conditions and improve the quality of testing and development. Teams can assess application performance, making it easier to adjust and optimize before deployment. The main advantage is that it enhances development reliability by quickly identifying bugs and enabling data-driven decisions.

However, this technique raises serious security and isolation concerns. If non-production environments are not properly isolated from the production network, there is a risk of compromising production data. Additionally, direct access to sensitive data without proper protection violates regulations such as GDPR, and standards like ISO 27001 or SOC2. To ensure compliance, it is crucial to protect data confidentiality through anonymization techniques.

These practices are particularly complex and require balancing risks with an appropriate level of security, depending on the project’s specific needs. A common alternative is using synthetic data, which eliminates security and isolation concerns by avoiding access to sensitive production data. However, this approach has limitations in terms of representing real-world scenarios and can be time-consuming to implement, reducing overall effectiveness.

Theodo’s point of view

At Theodo, we firmly believe that sensitive production data should not be used in development environments. Given the lack of ideal anonymization tools, we are successfully experimenting with synthetic data generation models (GenAI), allowing us to iterate quickly while maintaining a high level of security and data isolation.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

49

On Premise Data-Platform

"Data platform on-premise" refers to solutions where physical servers are installed either within a company’s premises or in a third-party data center. Unlike cloud-based systems provided by vendors, an on-premise platform requires full control over the physical infrastructure, comprehensive security management, and direct responsibility for software deployment.

Before the rise of cloud computing, all businesses relied on on-premise servers, but this approach involved heavy management overhead. Cloud providers later introduced managed services with a pay-as-you-go model, reducing upfront investment costs and offering greater agility.

Today, while cloud solutions dominate, some industries such as healthcare, finance, and government continue to prefer on-premise platforms to ensure data confidentiality, sovereignty, and full customization of software and hardware.

However, on-premise platforms come with several disadvantages:

Scaling up or down based on demand is more challenging
Significant upfront capital investment (time, hardware, human resources, and financial costs)
High operational expertise required for deployment and maintenance

Companies like Cloudera, Oracle, and IBM simplify on-premise deployments by handling infrastructure, maintenance, and software compatibility. However, these solutions remain more complex to implement than their cloud counterparts and involve additional costs in the form of subscriptions, licenses, and installation fees.

Theodo’s point of view

At Theodo, we work with clients in the banking and healthcare sectors on on-premise platforms. These projects tend to be long-term and require experienced professionals. We only recommend on-premise solutions for companies with strict regulatory or technical requirements that justify this choice.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Hold

54

Kibana for Data Quality

Kibana is a monitoring tool primarily used for analyzing and visualizing logs and metrics. It is an integral part of the Elastic Stack (formerly ELK Stack), alongside Elasticsearch and Logstash. Kibana allows users to create charts, interactive dashboards, and reports based on data indexed in Elasticsearch, providing a powerful interface for exploring and analyzing logs in real-time.

While Kibana is effective for log management, using it for data quality monitoring comes with several limitations:

Limited variety of visualizations: Kibana offers a restricted set of charts. For more customizable and insightful dashboards, third-party tools may be required.
Lack of data preprocessing: Kibana does not support data manipulation or advanced aggregations without relying on external transformations.
Steep learning curve: Users need to learn Vega and Kibana Query Language (KQL), and the lack of SQL support can be a barrier for beginners.

For more efficient data quality monitoring, alternatives like Elementary, which automatically visualizes dbt tests, or Sifflet, with its ready-to-use dashboards, may be more suitable. For customized tracking, a BI dashboard can also be a viable option.

Theodo’s point of view

A few years ago, we used Kibana for data quality monitoring. While it was effective for simple metrics, it proved too limited for more complex needs. We therefore recommend Kibana only for application log analysis and visualization.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

53

Great Expectations

Great Expectations is an open-source Python framework designed for evaluating and monitoring data quality. Its integration with most market tools is straightforward and fast, thanks to its compatibility with various data sources and orchestration tools. Additionally, the community actively contributes to the development of packages that provide predefined quality checks.

Great Expectations can connect to major database technologies (such as PostgreSQL, BigQuery) and storage solutions (like AWS S3, Google Cloud Storage). It also offers a logging and alerting system for quality check results, providing better visibility into data quality status.

However, Great Expectations has several significant limitations. Its learning curve is steep due to complex concepts that are not easy to grasp. Creating custom quality checks is often challenging and unintuitive, especially given that its documentation is insufficient for easing adoption. Additionally, the tool lacks robust features for cross-table validation, limiting its applicability. Finally, the user experience of its quality reports is suboptimal, making it difficult to track quality trends over time.

Theodo’s point of view

We do not recommend using Great Expectations due to its limitations, which make it difficult to adopt and scale effectively. For data quality monitoring, we suggest considering alternatives like Elementary, particularly if your pipeline relies on dbt.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Télécharger votre

Français

English

18.12.2025

Related Blip

17

Lambda / Cloud Run Functions

18 December 2025

No items found.

En savoir plus

47

Elementary

18 December 2025

No items found.

En savoir plus

40

Analytics Engineers

18 December 2025

No items found.

En savoir plus

19

Synapse

18 December 2025

No items found.

En savoir plus

51

Sifflet

18 December 2025

No items found.

En savoir plus

14

dbt with unit testing

18 December 2025

No items found.

En savoir plus

50

Prod data in dev

18 December 2025

No items found.

En savoir plus

27

Glue ETL

18 December 2025

No items found.

En savoir plus

Nos Radars

No items found.

Legal notices Privacy policy Cookie policy

© 2025 Theodo