Skip to main content

Open source Apache Airflow 2.9 advances data orchestration as AI usage grows

Credit: Apache Airflow
Credit: Apache Airflow

We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you’re implementing it, and what you expect to see in the future. Learn More


The open-source Apache Airflow 2.9 release is out today, providing users of the widely deployed data orchestration platform with a series of advanced capabilities.

The Apache Airflow technology was created by Airbnb as a way to help better organize and manage the flow of information in data pipelines. Airflow integrates with major data platforms and cloud provider systems including Snowflake, Databricks, AWS, Microsoft and Google Cloud. Airflow also benefits from commercial support and an expanded enterprise platform that is developed by Astronomer, which is one of the leading contributors to the project.

The Airflow project has a concept for data pipelines known as Directed Acyclic Graphs (DAGs) to organize and execute tasks on data. With the new Airflow 2.9 update, open-source technology has expanded how DAGs can be used to more effectively create and schedule datasets. There are also a series of enhancements to dynamic task wrapping for more parallel processing capabilities and visibility into task status. 

The Airflow 2.9 update comes as usage of the open-source technology, particularly to help enable data for AI use cases, is growing. According to the recent 2024 State of Airflow report, there are now approximately 30 million downloads of the open-source technology every month.


Register to access VB Transform On-Demand

In-person passes for VB Transform 2024 are now sold out! Don't miss out—register now for exclusive on-demand access available after the conference. Learn More


“Airflow is almost 10 years old at this point and the fact that it’s continuing to grow at this rate is always incredible to see,” Julian LaNeve, CTO at Astronomer, told VentureBeat. 

What’s new in Airflow 2.9

A core focus of the Airflow 2.9 update is enhancements to dataset objects to make Airflow easier and more intuitive to use.

With a dataset, Airfow has an awareness of the underlying data that it is orchestrating. Among the new dataset enhancements is a capability for conditional scheduling. The conditional scheduling on datasets feature in Airflow 2.9 allows pipelines to run based on conditional expressions involving datasets. This provides more advanced scheduling capabilities compared to just running everything when any dataset updates. This gives users more flexibility in how they define dependencies between pipelines and data updates.

“So you can say – I want this pipeline to run when either this Snowflake table or this other Snowflake table updates,” LaNeve explained. “Before it was everything all at once or nothing at all, so now you get to go do more advanced use cases with conditional operators.”

Airflow opens ups with a new API

Airflow 2.9 also introduces a new API capability that will have a dramatic impact on how the technology can be used even more widely. Previously, datasets were only understood and controlled within the context of Airflow itself. With the API capability, other tools and systems that are delivering or receiving data from Airflow can now connect to Airflow’s understanding of datasets.

For example, a common use case is data vendors dropping files into Amazon S3 cloud storage buckets. Now with the API, Airflow can be made aware when new data arrives in S3, and it can automatically trigger an update to the corresponding dataset object in Airflow. This then allows Airflow to kick off any dependent pipelines.

Another useful update in Airflow 2.9 is for dynamic task mapping. The dynamic task mapping feature in Airflow allows users to process a list of tasks, like files to process, in parallel instead of serially. Airflow 2.9 includes improvements to this feature that provide more visibility into the status of each individual task. Users can now more easily see which specific tasks have run successfully, which are still running, and which have failed. This makes it simpler to monitor and troubleshoot tasks when using dynamic task mapping to process work in parallel.

The state of Airflow sees more AI usage than ever before

Astronomer recently released the 2024 State of Apache Airflow report, revealing key trends in how open-source data orchestration technology is being used. Increasingly that usage is for AI use cases.

LaNeve noted that with modern generative AI-based applications, organizations are commonly supplementing large language models (LLMs) with their own data.

“That means that these Gen AI platforms are in essence an extension of your data platform,” he said. “Airflow is already orchestrating your entire data platform and so it’s naturally now starting to extend into these Gen AI platforms.”

In the 2024 state of Airflow report, nearly a quarter of users are using Airflow for gen AI and machine learning workloads. In contrast, LaNeve said that a year ago adoption was near zero. The spike in Airflow adoption to support AI workloads is all about the need for reliable data.

“When we deploy these things, we need data to be up to date, you need the right level of observability and you need your data to be reliable and this is exactly what Airflow is already delivering for your data platform,” LaNeve said. “Today the numbers are 24 or 25% adoption for AI, I think next year it’s gonna be much higher as we start to see even more adoption of LLMs especially in production.”