Unlocking data potential: Optimising data pipelines with dbt

Dominykas Sventickas

Teylor Greff

In today’s data-driven world, businesses rely on data platforms to fuel their decision-making processes. As organisations scale, ensuring definition consistency, data quality, and pipeline efficiency can become challenging and require increasing time commitment from the team. dbt (data build tool), can be leveraged to address this very problem. In this article we explore key dbt features that make data transformation processes simpler, more reliable and accessible. 

What is dbt?

dbt is a data transformation tool that allows teams to enhance the speed, reliability, and transparency of data transformation across projects. 

At its core, dbt enables analysts to quickly iterate on data transformations, promoting code reuse and ensuring consistency across projects. The logic is expressed through SQL select statements and, as such, is no longer siloed within the confines of data engineering teams and is accessible to the analysts who often have familiarity with the language and are often closes to their business domains.  

dbt’s modular approach, built-in orchestration, and some of the key functionalities we explore below allow the tool to act as the central transformation layer and empower teams to develop, test, and deploy data transformation logic, in turn fostering faster insights and more agile decision-making processes.  

Three common pipeline challenges 

Complexity

Traditional ETL tools, data quality management tools, or orchestration tools can become overly complex and rigid, making it challenging to adapt to changing business needs.   

Fragmentation

Siloed data sources hinder collaboration and integration efforts, leading to fragmented insights. 

Manual processes

Manual maintenance introduces errors and inefficiencies, impacting data reliability. Overcoming these challenges requires a modern approach like dbt – with its single-tool solution, analysts proficient in SQL can quickly upskill and tackle common challenges efficiently. 

How can dbt help you improve your data pipelines? 

Modularity and DRY code 

Modularity in data pipelines involves breaking down complex processes into reusable components, promoting maintainability and scalability. A key principle is DRY (Don’t Repeat Yourself) code, which emphasises eliminating duplication for better readability and maintenance. dbt excels in promoting modularity and DRY code through: 

Reusable models

Transformations in dbt are defined as reusable dynamically-referencable models, encouraging the creation of modular code snippets that can be easily shared and referenced in subsequent use cases and models.


Dependency management

dbt allows defining dependencies between models, ensuring correct execution order and facilitating modular workflows. dbt also provides Directed Acyclic Graph (DAGs) for your models as a visual representation of the dependencies.


Picture 1: A directed acyclic graph (DAG) illustrates the intricate path of source tables through various data ingestion steps. Highlighting modularity to optimise data management. 

Output Materialisation

dbt offers several options for how the data outputs are materialised in the data warehouse, giving the developer much more control over the speed and cost of their pipeline runs. Traditionally, data can be stored as views, tables and materialised views. With dbt, SQL models can be further run as ephemeral, incremental, and even more custom user-built materialisation strategies. 

Ephemeral models build on the idea of modularity and drive reusability by allowing abstraction of business logic as standalone models. In their nature, they are often too small in volume or offer minimal transformation to warrant being stored in the data warehouse. Instead, when compiled, the ephemeral model’s logic is inserted as a CTE in any downstream models to be run as part of those, leading to a consistent application of logic without adding too much clutter to the data warehouse. 

Incremental models are especially useful for larger, slowly-changing data sets as they allow improved scaling and cost-effective resource utilisation. Table models refresh the data fully; incremental models differ in that they have mechanisms that allow processing only subsets of data, reducing processing time and resource consumption. The developer can define the logic of how the subset should be defined giving them much more control over the behaviour of the model.  

Late-arriving facts, typically occurring due to delayed data updates or processing, necessitate a larger lookback window for accurate analysis. Defining a unique key to deduplicate these late-arriving facts becomes imperative. This unique key serves as a crucial mechanism for identifying and resolving any discrepancies arising from the delayed arrival of facts, ensuring data integrity and reliability in downstream analytics.

Automated data testing

dbt integrates testing capabilities into the data transformation workflow seamlessly, fostering wider business users’ trust in the data products they consume. Developers can choose from one of the four pre-built tests or further expand through reusable user-built tests; tests of either type are defined at the column level by adding them to a YAML file. dbt will recognise these definitions, compile them to relevant SQL snippets, and perform the tests throughout the development and release process automatically. The developer will be informed if the defined data quality standards are not met before the production release. Similarly, the same tests are built into the production data refresh runs to flag any changes in the underlying data and allow the data team to be the first ones to react to any such inaccuracies. 

Picture 2: In the orders table, the order_id column is tested with two out-of-the-box dbt tests for uniqueness and no nulls configured in the YAML file.

As a team, define the must-have tests that should be added to every model; as a minimum, we often suggest testing that primary keys are unique and not null. Incorporate tools such as dbt-checkpoint into your pull request process that will help ensure these standards are being followed.

Documentation

dbt offers a comprehensive documentation feature, empowering users to capture and manage descriptions for data models effectively. This functionality serves as a cornerstone for robust data governance, providing stakeholders with clear insights into the data’s structure and lineage. With an intuitive interface facilitating easy navigation through projects, including model and column descriptions, dbt enables users to grasp the overall project at a glance.  

Moreover, its automatic tracking of data lineage establishes a transparent view of dependencies within the data ecosystem, allowing for better impact assessment of changes. This enhanced documentation not only fosters collaboration and understanding but also reinforces dbt’s role as a central tool for driving informed business decisions and ensuring data integrity across the organisation.

Picture 3:  A snapshot of the comprehensive documentation provided by dbt, featuring detailed descriptions, column information, and an interactive lineage graph.

Descriptions you add to dbt can be programmed to be written to your data warehouse and from there, depending on your BI tool, the descriptions can flow to it, too. This allows consistent and up-to-date descriptions across your data platform, further enhancing dbt’s role as a central business logic store.

Conclusion

In the ever-evolving landscape of data management, dbt emerges as a solution of efficiency and innovation for businesses grappling with the complexities of data pipelines. By addressing common challenges such as bottlenecks of responsibilities, inconsistencies in data definitions, and difficulties in collaboration and reproducibility, dbt empowers teams to streamline their data transformation processes, improve data quality, and accelerate time-to-insight.  

Ultimately, the adoption of dbt translates into tangible benefits for businesses, including cost savings, improved decision-making, and more time to focus on strategic initiatives. As organisations continue to navigate the evolving landscape of data-driven decision-making, dbt stands out as a foundational tool for unlocking the full potential of their data assets and driving sustainable growth.  

If you are considering integrating dbt into your data strategy, do not hesitate to reach out to the Modasta team. With our expertise in implementing dbt across businesses of all sizes, we are here to help you optimise your data pipeline and unlock new opportunities for success. Contact us today to embark on your journey towards data-driven excellence! 

Contributors

Dominykas Sventickas

Dom, a lead consultant with expertise in business intelligence and analytics engineering, is dedicated to empowering organisations to turn raw data into actionable insights. He excels in guiding teams through the transformative journey of adopting and mastering the Modern Data Stack. His past clients span various industries, showcasing his proficiency in harnessing data and developing data platforms to enhance decision-making, streamline operations, and drive organisational growth.

Teylor Greff

Teylor is a senior data consultant specialising in business intelligence. She has a breadth of experience leading technology implementations for our clients across many common tools in the Modern Data Stack including Looker, dbt, Snowflake and Fivetran. Teylor enjoys empowering others to leverage the value of data through upskilling technical and business users on data best practices and developed solutions.