Theodo

Standalone Apache Parquet

One Big Table

Airbyte

Star Schema

DuckDB

Snowflake Data Warehouse

Pub/Sub

dlt

Databricks Data Platform

Apache Iceberg

BigQuery

Hold

Assess

Trial

Adopt

Snowflake Data Warehouse

Created in 2014, Snowflake is a proprietary Data Warehousing solution designed for storing, managing, and analyzing large data volumes. Along with BigQuery, it was among the first to decouple computing and storage resources, offering the flexibility to scale them independently, thereby improving performance and cost optimization.

The main strengths of Snowflake are its query performance on structured data and its ability to manage multiple users. With its unique micro-partitioning system and parallelization capabilities, Snowflake can refresh a dashboard in seconds, even with datasets of hundreds of gigabytes.

It also provides advanced data access management tools, allowing user permissions to be controlled down to the column and row levels, making it easy to handle many users with different access levels. Other notable features include an intuitive user interface, the ability to explore data using Python notebooks, and its cloud-agnostic nature, as it can run on all three major cloud providers.

However, Snowflake has some limitations. First, the platform is not well-suited for streaming, as its architecture is optimized for analytical processing rather than real-time transactional workloads. Additionally, despite a transparent pricing model, costs can escalate quickly compared to its competitors.

THEODO'S POINT OF VIEW

Today, we recommend using Snowflake as an easy-to-use Data Warehouse when dealing with large data volumes and multiple analytical users. It is also the right choice for organizations looking to avoid vendor lock-in with a specific cloud provider.

MDN’S POINT OF VIEW

Snowflake is a highly efficient data warehouse. Thanks to features like partitioning and clones, working with and sharing large datasets is simplified. It can query data from flat files or databases like PostgreSQL and MySQL. I recommend Snowflake as a central access point for data in an analytical context.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Pub/Sub

Google Cloud Pub/Sub is an asynchronous messaging service that enables reliable and scalable message exchange between different applications.

It is designed to connect heterogeneous components, such as microservices, web applications, and IoT devices, facilitating real-time message transfer. It operates on a publisher-subscriber model: publishers send messages to a topic, while subscribers subscribe to that topic to receive messages.

The key advantages of Pub/Sub include its scalability, thanks to automatic scaling, and its seamless integration with other GCP services (such as Google Cloud IAM for security or Cloud Functions and Dataflow for data processing). Fully managed, Pub/Sub is also easier to set up than Kafka, its main competitor, which is often self-hosted. However, Pub/Sub provides less control and customization, particularly in data flow management: it does not support JSON schemas and limits schema revisions to 20 versions.

Additionally, for very high data volumes, a self-hosted Kafka solution may be more cost-effective.

THEODO'S POINT OF VIEW

We recommend adopting Pub/Sub if your company is already integrated into the Google Cloud ecosystem or if you are looking for a managed solution to reduce the operational workload associated with messaging infrastructure management. Google Cloud Pub/Sub is a robust and high-performance option for organizations seeking to implement a reliable and scalable messaging infrastructure.

MDN’S POINT OF VIEW

This self-managed queuing technology by Google is well integrated with the entire GCP ecosystem. Simpler than Kafka, but with fewer configurable options (such as the number of acknowledgments at the replica level), it is geo-distributed across multiple GCP regions to prevent message loss. However, additional costs should be anticipated for egress/ingress traffic and message replication across regions.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

dlt

dlt (or Data Load Tool) is an open-source tool that enables data ingestion. As a Python library, DLT is composable and does not require a heavy architecture—just a simple pip install dlt is enough. To load data, you need to initialize the source, provide credentials, and configure the necessary endpoints. The code can then be executed directly within the orchestrator of your choice. DLT integrates seamlessly into both analytics and AI projects, supporting data ingestion for agents or more traditional models.

By default, dlt can load data into DuckDB, but it also works with all standard destinations. Its lightweight nature makes it a cost-effective tool for building EL processes in data lakes or data warehouses. Ingestion can be easily launched in Cloud Run containers or even within a CI/CD pipeline. With the standardization of Iceberg and DuckDB support, DLT also simplifies the transition between local work environments and production environments, streamlining what is often a complex process.

DLT also provides semantic contracts (data contracts) that overlay different sources, allowing for programmatic generation of everything downstream of ingestion.

MDN’S POINT OF VIEW

Data ingestion has always been a complex topic, whether using custom tools or off-the-shelf solutions. dlt brings structure to defining ingestions by combining the advantages of pre-built solutions while allowing for easy customization for specific use cases. Thanks to the dlt + DuckDB combination, it is now possible to set up ELT processes with very few lines of code.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Databricks Data Platform

When it was created in 2013, Databricks set out to democratize access to big data processing. To achieve this, Databricks developed the Databricks Data Platform, a proprietary SaaS service deployed on top of a cloud, which became available in 2015. The platform was later enhanced with machine learning capabilities via MLFlow, a data catalog with Unity Catalog, and an orchestrator with Delta Live Tables.

Before Databricks, a data engineer had to manually configure their cluster, whether Hadoop or another system, to use Spark. They could not explore their data directly with Spark and had to rely on additional tools such as Zeppelin. Thanks to Databricks, data engineers can now implement Spark transformations without worrying about infrastructure management. Once the platform is installed and pre-configured clusters are set up, data engineers can operate autonomously to execute their transformations. Databricks Notebooks also allow them to explore their data directly within the platform.

Additionally, Unity Catalog enables data documentation and access control. Databricks is ideal for companies handling large data volumes and for facilitating collaboration between data engineers and data scientists. It is also a turnkey data platform that simplifies maintenance, making it a smart choice when integrating new tools into an existing IT system is complex.

However, Databricks is heavily tied to Spark and only fully reveals its potential when the data volumes processed are large enough to benefit from parallel computing.

THEODO’S POINT OF VIEW

At Theodo, we recommend Databricks Data Platform, especially for processing large-scale data: it is a mature, high-performance, and comprehensive technology.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Apache Iceberg

Apache Iceberg is an open-source table format created at Netflix. Its primary goal is to address the challenges of managing large datasets stored on distributed file systems like S3 or HDFS. Iceberg was designed to overcome the limitations of traditional table formats such as Hive, facilitating complex data modification and access operations while ensuring better transaction isolation due to:

• its native compatibility with SQL for reading and writing
• its ability to support full schema evolution
• its capability to handle massive datasets at the petabyte scale
• its fine-grained versioning with time travel and rollback features
• its guarantee of ACID transactions in a multi-user environment
• its scalable and efficient partitioning and compaction system for optimized read performance

Iceberg truly enables the datalakehouse paradigm, which structures a data lake to directly leverage its data.

However, it can make data pipelines more complex, particularly in terms of configuration, partition management, and maintenance—especially for teams unfamiliar with this format. Adopting Iceberg may require a steep learning curve for organizations that lack maturity in data lake management.

MDN’S POINT OF VIEW

Iceberg is at the heart of data trends in 2024. Initially developed at Netflix, this open-source table format is rapidly establishing itself as the interoperable file standard for managing tables in data lake architectures. If you are building your data lake today, you should consider it without hesitation.

THEODO’S POINT OF VIEW

At Theodo, we believe that Iceberg is an excellent solution for optimizing performance and storage in cases of high data volume. It appears to outperform alternatives like Delta Lake or Apache Hudi due to its flexibility in schema and partition evolution and better integration within modern architectures such as BigQuery or Snowflake.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

BigQuery

BigQuery is Google Cloud’s fully managed data warehouse. It enables massive interactive storage and analysis of large datasets. It is the central analytics hub of the offering. The main alternatives we use are:

• Amazon Redshift, the pioneer
• Snowflake
• Azure Synapse
• Databricks

BigQuery is a comprehensive and efficient solution, widely adopted, but its offering is also very similar to its main competitors.

Performance ranking depends on the conditions in which the benchmark was conducted, while the advantage, however, is on the side of BigQuery and Snowflake in terms of the number of features. The choice will depend on a set of criteria related to constraints and usage: it will not be limited to the data warehouse aspect but will encompass all infrastructure needs. BigQuery’s strength lies in its ease of use and flexibility. The default pricing model is on-demand: the compute bill amount depends on the volume of data scanned in the input tables of each query. The allocated power and cost adjust according to the queries.

This also presents the risk of an uncontrolled bill, but Google provides tools for setting up quotas, monitoring dashboards, and alerting. To maintain budget control and a good level of performance, these measures must be coupled with best practices and optimizations. Moreover, for better predictability, it is also possible to opt for capacity pricing, provided that the need is stabilized and a team can manage slot reservations. In terms of machine learning, it allows the creation, training, and execution of models, and BigQuery DataFrames provides a pandas-compatible API for analytics and a scikit-learn-like API.

On the BI side, the workspace provides direct access to Looker Studio visualizations. Finally, BigQuery, like its competitors, is constantly evolving and following market trends while becoming increasingly open, notably with BigLake tables that support Delta Lake, Iceberg, and Hudi formats. With the Omni feature, it is also possible to run queries on external sources such as Amazon S3 or Azure Blob Storage.

MDN’S POINT OF VIEW

A very good data warehouse technology, similar to Snowflake in its approach. For now, it is limited to the GCP ecosystem. It allows processing available data without index management on tables but offers more advanced configuration possibilities in certain cases. Billing is per query (in this case, budget monitoring is crucial) + table storage + ingestion.

THEODO’S POINT OF VIEW

We recommend BigQuery for its flexibility, its rich set of features, and its capabilities in BI and ML. It is suitable for handling both small and large data volumes and integrates perfectly with other Google Cloud services. It can be a strong argument for choosing this cloud provider.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Trial

One Big Table

The One Big Table (OBT) data modeling approach consists of storing all relevant data in a single large denormalized table rather than distributing it across multiple tables, which simplifies data models and makes them easier to use. We use OBT to simplify data access, which is beneficial for fast analysis or for less technical teams, as they do not have to manage the complexity of joins or relationships between tables.

With the rise of LLMs (Large Language Models), we use the OBT approach to efficiently process large datasets, particularly facilitating Text-to-SQL (see page 68) by simplifying the queries to be created.

The redundancies caused by denormalization are not a major issue when using column-oriented databases such as Snowflake and BigQuery, and the low-cost storage allows a focus on reducing compute costs.

However, the maintenance drawback lies in managing a single massive table, which can become complex, especially when data changes frequently or when new sources are added. Denormalization can also lead to inconsistencies if not managed carefully, requiring the implementation of robust processes to ensure data consistency.

Theodo’s point of view

We recommend carefully considering the long-term impact: maintenance and data quality challenges can become complex. One Big Table can be combined with normalized structures depending on the use case. This allows benefiting from ease of access while maintaining the flexibility and manageability offered by a traditional relational database.

MDN’s point of view

One Big Table simplifies data access to make business teams autonomous. It is an essential solution to improve the adoption and use of the semantic layer (gold layer). By choosing OBTs, joins are avoided, and indicators are computed at the finest level of granularity necessary for their consumption.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Airbyte

Created in 2020, Airbyte is a solution for ingesting and loading data into a data platform, simplifying data movement by standardizing sources and destinations. Instead of providing end-to-end connectors (e.g., GCS → BigQuery), it offers a list of source connectors and a list of destination connectors, allowing any source to be paired with any destination. This flexibility is enabled by the Airbyte protocol, which strictly defines the contract that sources and destinations must comply with.

The main difference between Airbyte and Fivetran, a historical player in modular connector-based ingestion, is Airbyte’s open-source approach: it provides over 350 connectors, most of which are developed by the community. Creating custom connectors is greatly simplified thanks to this protocol and the availability of SDKs, making it easy to develop a source connector for a specific API, which then becomes compatible with all existing destination connectors.

Airbyte is available in different forms, suited to various use cases:

Airbyte OSS (open-source): No license costs, offering great flexibility in how you install and configure your instance, but requiring self-managed deployment and hosting. These can be complex, as they involve managing both the Airbyte instance and its associated database separately.
Airbyte Cloud: A managed version with similar functionalities, plus technical support, without the operational burden of managing an on-premise instance. Pricing is based on data volume moved and number of rows processed.
PyAirbyte: A Python package providing access to most Airbyte source connectors directly within scripts, without the orchestration functionalities of the other solutions. PyAirbyte is used as an entry point for data processing, rather than a full end-to-end solution like Airbyte OSS or Airbyte Cloud. It can be used alongside an orchestrator such as Airflow.

Despite its strengths, Airbyte has some limitations: while its generalized protocol is valuable, it enforces sequential data transfer, which can be inefficient for large-scale ingestions. Additionally, while Airbyte can be deployed and configured automatically, this feature remains experimental, particularly for custom connectors.

Theodo’s point of view

We recommend trying Airbyte, whether in cloud or managed mode, if you are not ingesting large data volumes. While the technology has some weaknesses, it is robust, well-designed, and benefits from a dynamic community that ensures rapid product evolution and valuable support in case of difficulties.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Star Schema

The star schema is a data modeling method commonly used in data warehouses and business intelligence (BI) systems. This simple model organizes data around a central fact table, which is linked to dimension tables. This structure facilitates the creation of efficient SQL queries and enables fast analysis through BI tools optimized for querying and refreshing data that follows the star schema, such as Power BI. The star schema is ideal for decision support systems, where query simplicity and processing efficiency are essential. Queries are easier to write and execute, reducing complexity and minimizing errors in reporting.

However, the star schema has some limitations. The lack of systematic normalization can lead to data redundancy in dimension tables, increasing storage volume. Additionally, business rule changes can be difficult to implement, requiring complex adjustments to fact and dimension tables.

Alternatives to the star schema include:

Data Vault: More flexible for handling frequent changes in business rules.
One Big Table: Simplifies data access but can lead to complex maintenance operations when data structures change.

Theodo’s point of view

We recommend the star schema and dimensional modeling in general for structuring complex data at scale. They are flexible and proven approaches. However, other models may be preferable depending on the use case, such as Data Vault or One Big Table for analyzing large IoT datasets that lack clear dimensions.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

DuckDB

Created in 2019 in Amsterdam, DuckDB can be used as a single-node analytical engine, capable of replacing Spark, as well as a mutable column-oriented database. Compatible with major languages (Python, Java, R, Node, ODBC) and usable both in backend and frontend via WASM, it is an open-source technology that serves not only as a database but also as a computational tool.

DuckDB has matured with its version 1.0, introducing innovations that are reshaping the data ecosystem. The technology aligns with the "big data is dead" movement: most datasets are not large enough to justify distributed computing technologies. DuckDB removes the need for client-server communication, which, according to its creators, is one of the main causes of latency in traditional databases. Its performance is impressive, and when combined with the power of modern personal computers, it allows many data processing tasks to be performed locally rather than in the cloud.

It is important to note DuckDB’s main limitation: it is a single-node technology (runs on a single machine) and supports only one connection at a time (only one user can connect simultaneously).

MDN’S POINT OF VIEW

DuckDB is my favorite technology of the past two years: while the "modern data stack" philosophy pushes for executing SQL on a cloud connection, DuckDB runs everything locally in a simple and efficient manner. DuckDB has its limitations, but it can help reduce your processing costs significantly.

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

AssesS

No items found.

Hold

Standalone Apache Parquet

Introduced in 2013, Apache Parquet is an open-source columnar file format that has become the standard for large-scale data storage and management. Designed to optimize read and write performance on large datasets while reducing storage space requirements, it has replaced flat file formats like CSV in data engineering.

Parquet offers many advantages:

Significant file compression (up to ten times smaller than CSV),
High performance: 30 times faster than CSV for reading and writing analytical queries,
Support for complex data types, including nested structures (lists, dictionaries...),
Strong integration with major cloud providers and broad compatibility with many open-source tools.

However, Parquet is not ideal for frequent writes or real-time data streaming, where a row-based format like Avro is more suitable.

Formats such as Delta Lake or Apache Iceberg are preferable for ensuring better data governance, handling structural changes in tables, and maintaining data integrity in cases of concurrent writes.

Theodo’s point of view

Parquet remains a solid technology with advantages for analytical workloads due to its performance and storage optimization. Our hold position reflects our recommendation to adopt additional layers like Iceberg to benefit from transactional capabilities and data scalability.

MDN’s point of view

Parquet has become the reference format for analytics. Compressed, columnar, and widely compatible, there are many reasons to use Apache Parquet in 2024. It is preferable to flat formats like CSV or JSON. Essential for saving costs and improving performance. The only drawback is that it is less convenient to open in a graphical interface (unless using DuckDB).

Notre point de vue

Le point de vue de notre partenaire

Voir l'historique du blip

Télécharger votre

Related Blip

Lambda / Cloud Run Functions

Elementary

Analytics Engineers

Synapse

Sifflet

dbt with unit testing

Prod data in dev

Glue ETL

Nos Radars

Télécharger votre

Related Blip

Lambda / Cloud Run Functions

Elementary

Analytics Engineers

Synapse

Sifflet

dbt with unit testing

Prod data in dev

Glue ETL

Nos Radars

Cookie settings

Cookie settings