What is Data Engineering: Part 2

👋 Hi, this is Gergely with a free issue of the Pragmatic Engineer Newsletter. If you’re not a full subscriber yet, you missed today’s subscriber-only issue on Consolidating technologies and a few other issues. To get a similarly in-depth article every week, subscribe to The Pragmatic Engineer Newsletter.

Q: I’m hearing more about data engineering. As a software engineer, why is it important, what’s worth knowing about this field, and could it be worth transitioning into this area?

To answer this question, I pulled in Benjamin Rogojan, who also goes by Seattle Data Guy, on his popular data engineering blog and YouTube channel.

In Part 1 of the series, Ben covered:

What do data engineers do?
Data engineering terms.
Why data engineering is becoming more important.

In the second and final part of the series, we cover:

Data engineering tools: an overview
Where is data engineering headed?
Getting into data engineering as a software engineer

With this, over to Ben:

1. Data engineering tools: an overview

When it comes to data engineering, there is no shortage of tools. It goes without saying tools like GitHub, databases, baseline cloud services and also coding, are all needed for data engineers. This being said, there are some specific tools you will need to learn if you start looking into data engineering.

A few of the common data engineering tools

Data Storage

Snowflake was the first widely adopted cloud data platform which separated storage and computation. This gave users the ability to quickly switch between small, medium, and large data warehouses. It also provided a very familiar standard data warehouse “feel.”

The ability to separate computation and storage allows database software to increase:

Availability.
Scalability for compute and storage.
Reduces Costs.

You don’t need to scale up or scale down data warehouses and your team can easily pick how much computing is required.

This change, coupled with the fact that Snowflake felt more like a traditional data warehouse, made it very popular. Currently, depending on who you ask, Snowflake has 15% to 18% of the market. Learn more about Snowflake in this video I made: Why everyone cares about Snowflake.

Databricks (Delta Lake) Databricks itself is very tightly coupled with Spark, which we cover in more depth later. That's because it was developed by the same people. The company itself was started back in 2013 by the original founders of Spark, the UC Berkeley professors Ali Ghodsi, Ion Stoica, and Matei Zaharia. Databricks provides a data platform that combines several managed services, including Spark, Delta Lake, and MLflow.

Delta Lake acts as the storage component for Databricks. It is an open-source storage framework that provides support for ACID transactions, schema enforcement, time travel(meaning rollbacks and historical audit trails), and several other ever-expanding features.

The truth is tools like Snowflake and Databricks are both far more than storage. But they also tend to be the location we store data.

Apache Iceberg. I have to include Apache Iceberg as solutions such as this are 100% pure storage. Unlike Databricks and Snowflake which manage both computing and storage, there are newer solutions such as Apache Iceberg which are only storage. Iceberg is a high-performance format for huge analytic tables. It brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala, to safely work with the same tables, simultaneously.

Data Processing Engines

Presto/Trino. Presto was developed at Facebook as an open-source distributed query engine which supports much of the SQL analytics workload at Facebook. Presto’s Connector API allows plugins to provide a high performance I/O interface to dozens of data sources, including Hadoop data warehouses, RDBMSs, NoSQL systems, and stream processing systems.

What is interesting to know about Presto is that, due to some legal complications, its original founders were forced to branch off of Presto’s initial open-source project and develop what is now called Trino. The founders reveal more about this split in the recently published article, Why leaving Facebook/Meta was the best thing we could do for the Trino Community.

Spark. Apache Spark started in 2009 as a research project at UC Berkeley's AMPLab. You can read more about it in the paper, "Spark: Cluster Computing with Working Sets.” Sparks' goal was to balance the fault tolerance and scalability of Hadoop Map Reduce, while providing the ability to reuse a working set of data across multiple parallel operations. What I personally enjoy about Spark is it is extendable to so many different languages. SparkSQL, Python, Scala, Java and R are all capable of being used to process data.

Orchestration

Airflow is currently one of the more popular available orchestrators. It was developed by Maxime Beauchemin at Airbnb as a solution for the problem of always being one generation behind the needs of users with workflow orchestration tools. Airflow was written to be “configuration as code,” using Python as its language of choice. Thanks to what I consider a relatively straightforward approach to defining DAGs (directed acyclic graphs) and the tasks within them, as well as the fact that Airflow is open source, Airflow was quickly adopted.

Prefect is also a Python-based orchestration tool. Unlike Airflow which involves a little more of a learning curve in terms of understanding how DAGs are created, Prefect is much more Pythonic. Meaning it is written as if you were writing Python code. It also has built-in workflow version control and is far easier to test compared to Airflow. It’s often hard to discuss Prefect without referring to Airflow, as part of the reason it was developed was to improve on some of Airflow’s limitations.

Dagster was developed by Nick Schrock, who co-created GraphQL. Dagster takes what they call a declarative approach to data management. They view this declarative data management approach starting with code to define the data assets you want to exist. These asset definitions, version-controlled through git and inspectable via tooling, allow anyone in your organization to understand your canonical set of data assets, allow you to reproduce them at any time, and offer a foundation for asset-based orchestration. Similar to both Airflow and Prefect, Dagster is Python-based.

Other tools

The above list barely even scratches the surface in terms of tools and tool categories. There are streaming solutions like Kafka and Debezium (for streaming changes,) there are tools like dbt which have also grown in popularity and have arguably created their own new roles, as well as now have a fair share of competing products such as Coalesce. In the end, there are just far too many tools to truly keep track of.

2. Where is Data Engineering headed?

Data engineering has changed a lot during the past few decades. Prior to 2010, you probably would not have come across the term. Perhaps you heard of DBA or ETL developers, but somewhere around the turn of the millennium data scientist was popularized and so was data engineering.

Specialization of Analytics Engineering - Even in the past decade a lot has changed. The title of analytics engineer was developed and popularized as a role which can help offload a lot of the work data engineers did on the analytics side of work.

Real-Time Is Becoming Easier - Besides the increasing specialization of data engineering and analytics engineering, we are also seeing the ability to approach problems such as streaming with much more robust and easier to manage tools.

So, the potential of real-time data analytics is becoming more feasible for a broader range of companies. This could have an impact on some other layers of data engineering work such as reducing the amount of batch processing required. It also would require a lot of buy-in from those who understand the batch processing paradigm, which will likely be a much larger hurdle.

Treating data as a product vs as a pure IT function - Another trend is treating data more like a product. This aligns with the data mesh paradigm which has been gaining traction in the past 2-3 years. With this mentality, rather than having a strong centralized data team, data teams often work on specific nodes which are generally attached to a domain. Essentially, these are “microservices” for data. With this approach, the goal is to take data from merely being an operational byproduct, to a purposely developed feature which can be used to build further applications and values.

Renaissance of Best Practices - Another trend being rekindled is that of best practices. For example, the head of data at Convoy, Chad Sanderson, has written several posts regarding the importance of data modeling and how much it has been neglected in the past 3-5 years. He, along with several other data experts, are leading the charge to once again make data modeling important.

In the end, data engineering is going through a lot of changes, many of which have been driven by increasing demands for data, but also changing application design paradigms.

3. Getting into Data Engineering as a Software Engineer

It’s not uncommon for me to be asked by software engineers or IT specialists,

“How do I break into data engineering?”

Software engineers often have a solid foundation to transition into data engineering. If you’ve been programming and developing software for the past few years, you should already be familiar with languages such as Python and have a baseline in SQL. There are a few key areas I have noticed software engineers usually either need to learn about, or polish up on. In particular, to do with the fact that you’re now dealing with billions of rows, not a single transaction-worth of data. It’s the OLTP vs OLAP paradigm switch.

In turn, most software engineers need to improve or dig into:

Data warehouses / data lakes / data modeling.
Advanced SQL - Beyond just CRUD.
Specialized frameworks such as Spark and Flink.
Understanding data pipelines and all the steps from source to target.

I’d also like to add that some software engineers move into data engineering roles assuming they will be working on distributed systems or developing infrastructure for data teams. This isn’t always the case.

These roles are generally denoted as “software engineering, data or data infra,” at many Big Tech companies. That being said, there are plenty of large tech companies whose data engineer role is more like a software role. It’s just not always apparent from the job description.

Conclusion

Data engineers play an important role creating data models and assets which can be utilized by analysts and product managers, easily. But where is all this headed?

Data engineering has, for a long time, struggled with a case of “middle child syndrome.” Even in data driven companies, data engineering has often felt like a blocker to software engineers, analysts and data scientists who want to get their work out quickly.

Software engineers actually produce the application that’s used by end-users and data scientists and analysts provide valuable insights and models which can increase revenues and decrease costs. Tying the work of data engineers to the company’s bottom line is far from trivial. This tying of work to the bottom line is far easier for data scientists and software engineers than for data engineers.

Data engineering still tends to be often viewed as legacy IT and as a cost center. However, I do believe this view is slowly changing.

Airbnb wrote about how they placed a massive bet on data quality. In this article, they share how they ran into the problem of not having a strong enough data engineering team to manage the initiative in pursuit of data quality. They realized data quality and ownership were key to improving their data strategy, which led them to the conclusion that they needed an expanded data engineering team. Hopefully, this came with a realization of the value of data engineering and how hard it really is to replace this function.

Data engineering becoming more valued is supported by further specialization taking place in the field. Specialized roles such as analytics engineers, are now more common in the industry.

Splitting the heavily technical work from more business-focused logic implementation and delivery layers, is proving a necessity in a world which wants to operationalize data and possibly turn it into a new core for data applications – at least that's what Snowflake and Databricks are betting on. But this North Star is far away from where we currently are. However, at companies which value data and essentially rely on it to stand out, the ability to hire, train and manage strong teams of data engineers will be a differentiator.

Data engineering faces a lot of challenges driven by the increasing size, complexity and speed of data. However, one of the most important upcoming challenges I foresee is actually quite an old one.

Translating business data into approachable data sets for analysts and data scientists, is still a major challenge. All the business applications and internal systems are messy and hard to truly approach. There isn’t anything unknown about what’s required in terms of data modeling; it all comes down to execution and experience.

There have been a lot of changes in the data engineering world, even in the last few years, so I will avoid any bombastic predictions. But I do feel there is currently a return to the mean, as we figure out how to let all the new tools developed in the last decade, as well as many of the best practices that have existed for well over 3-4 decades, find their places.

This is Gergely again.

Thanks very much to Ben for this comprehensive and broad overview of data engineering. I personally find it refreshing to have a system for thinking about data engineering terms and the tools which we hear lots about, just from being in the tech industry – such as about Snowflake or Databricks.

Several software engineer friends of mine have made the leap to transition into data engineering in the past few years, and most whom I’ve talked with have been happy with their decision. Some mentioned the same downsides which Ben summed up: data engineering is a new field, businesses often don’t fully understand it, and career paths can feel limited in some ways.

What is certain is that every business has more data and they want to put that data to good use. As a well-rounded software engineer, understanding how to work with these large data sets is the least we should do. And, given the opportunity, it could be a great challenge to get your hands dirty by building data engineering solutions.

If you’d like to learn more about the data engineering field, please get in touch with Ben, the Seattle Data Guy:

Subscribe to the Seattle Data Guy Newsletter
Subscribe to the Seattle Data Guy YouTube channel
Browse the Seattle Data Guy blog

I hope you found this detailed overview just as useful as I did!

Subscribe to my weekly newsletter to get articles like this in your inbox. It's a pretty good read - and the #1 software engineering newsletter on Substack.