Background

Description of a workflow and telescope

The observatory platform collects data from many different sources and aggregates the different data sources as well. This is all done with a variety of different workflows. Some are aggregation or analytical workflows and some are workflows collecting the original data from a single data source. The last type of workflow is referred to as a telescope. A telescope collects data from a single data source and should try to capture the data in its original state as much as possible.
A telescope can generally be described with these tasks:

  • Extract the raw data from an external source

  • Store the raw data in a bucket

  • Transform the data, so it is ready to be loaded into the data warehouse

  • Store the transformed data in a bucket

  • Load the data into the data warehouse

Managing workflows with Airflow

The workflows are all managed using Airflow. This workflow management system helps to schedule and monitor the many different workflows. Airflow works with DAG (Directed Acyclic Graph) objects that are defined in a Python script.
The definition of a DAG according to Airflow is as follows:

A dag (directed acyclic graph) is a collection of tasks with directional dependencies. A dag also has a schedule, a start date and an end date (optional). For each schedule, (say daily or hourly), the DAG needs to run each individual tasks as their dependencies are met.

Generally speaking, a workflow is conveyed in a single DAG and there is a 1 on 1 mapping between DAGs and workflows.

Workflows and Google Cloud Platform

The Observatory Platform currently uses Google Cloud Platform as a platform for data storage and a data warehouse. This means that the data is stored in Google Cloud Storage buckets and loaded into Google Cloud BigQuery, which functions as the data warehouse. To be able to access the different Google Cloud resources (such as Storage Buckets and BigQuery), the GOOGLE_APPLICATION_CREDENTIALS environment variable is set on the Compute Engine that hosts Airflow.
This is all done when installing either the Observatory Platform Development Environment or Terraform Environment. Many workflows make use of Google Cloud utility functions and these functions assume that the default credentials are already set.

The workflow templates

Initially the workflows in the observatory platform were each developed individually. There would be a workflow and release class that was unique for each workflow. After developing a few workflows it became clear that there are many similarities between the workflows and the classes that were developed.
For example, many tasks such as uploading data to a bucket or loading data into BigQuery were the same for different workflows and only variables like filenames and schemas would be different. The same properties were also often implemented, for example a download folder, release date and the many Airflow related properties such as the DAG id, schedule interval, start date etc.

These similarities prompted the development of a workflow template that can be used as a basis for a new workflow. Additionally, the template abstracts away the code to create an Airflow DAG object, making it possible to use the template and develop workflows without previous Airflow knowledge.
Having said that, some basic Airflow knowledge could come in handy, as it might help to understand the possibilities and limitations of the template. The template also implements properties that are often used and common tasks such as cleaning up local files at the end of the workflow.
The initial template is referred to as the ‘base’ template and is used as a base for three other templates that implement more specific tasks for loading data into BigQuery and have some properties set to specific values (such as whether previous DAG runs should be run using the airflow ‘catchup’ setting). The base template and the other three templates (snapshot, stream and organisation) are all explained in more detail below. Each of the templates also have their own corresponding release class, this class contains properties and methods that are related to the specific release of a data source, these are also explained in more detail below.