Skip to content

Pipeline-as-code in Yaml

In Serenytics, you can create pipeline to manipulate data using the user interface. You can also define a list of data transformation in a YAML file, and execute it.

Yaml pipelines handle three types of operations:

  • sync a list of datasources into another server
  • run a transformation step (select, groupby, where)
  • run a raw SQL query and store the result in a new table

The advantages of using the Yaml version rather than the UX version are:

  • a single file can contain many operations.
  • you can manage it in Git to manage new versions
  • it's a single operation to deploy it on a Serenytics instance
  • it's a single operation to run it

The more you need to industrialize your data pipelines, the better it is to use pipeline-as-code.

The next sections give details about the Yaml structure to use.

Here is an example with 3 steps:

configuration:
    notes: Here is a simple dataflow that sync two tables from a first datawarehouse to a second one and run an ETL step

steps:
    - name: Run a simple ETL step in a table in clickHouse
      type: TransformationStep
      inputDataSource: postgredemo <> orders
      outputDataSource: ch_users_by_category
      outputFolder: orders_analysis
      transformation:
          select:
            - name: customer_id
              agg: dcount
              alias: nb of users
            - name: category
          groupby:
            - name: category
          where:
            - name: category
              op: contains
              value: Bikes
    - name: Run a simple SQL Query
      type: SQLStep
      inputDw: postgredemo
      outputDw: postgredemo
      outputDatasource: pg_output_sql_yaml_step
      outputFolder: output_etl_yaml
      query: select category from orders
    - name: Sync postGreSQL to ClickHouse
      type: SyncListOfDatasources
      outputDw: Managed ClickHouse DW
      inputDataSources:
          - postgredemo <> customers
          - postgredemo <> orders
      outputDataSourcesPrefix: ch
      outputFolder: synced tables

TransformationStep

Get data from an input datasource, apply a transformation on it, and store the result in the output datasource.

If both input and output are on the same server (except for ClickHouse), the transformation will be executed as a single SQL Query. Otherwise, the result of the transformation will be stored into a file which will be loaded into the destination datasource.

Parameters:

  • inputDataSource: Name of the input DataSource
  • outputDataSource : Name of the output DataSource
  • outputDw [Optional]: If the outputDataSource does not already exist and if this param is defined, the outputDataSource will be created in this dw, otherwise, it will be created in the dw of the inputDataSource
  • outputFolder: Name of the output folder where to create output DataSource if it does not exist
  • transformation: Description of the transformation to apply. Must have 3 children: select, groupby and where.

Describe the transformation to apply

Select node

Parameters are:

  • name: name of the column
  • agg [Optional]: aggregation function to use if a groupby is defined. Values are: sum, avg, min, max, dcount
  • alias [Optional]: alias for the column in the output

Groupby node

Parameters

  • name: name of the column

Where node

List of where conditions. Each entry must have the following parameters:

  • name: name of the column
  • op: operator in the condition. Values are: '==', '<>', ">", ">=", "<", "<=", "in","!in","is_defined","is_not_defined","is_empty","is_not_empty","contains","!contains","is_true","is_false"
  • value: value in the condition

SQLStep

Run a SQL query in a datawarehouse and store the result in a table of a datawarehouse. Input and output can be two different datawarehouses of the same one.

Parameters:

  • inputDw : Name of the input dw
  • outputDw : Name of the output dw
  • outputDatasource: Name of the output datasource (created if not found)
  • outputFolder: Name of the output folder where to create datasources that does not exist
  • query: The SQL query

Full sync step

Synchronize a list of datasources from a dw to another one.

Destination datasources will have the same names as input datasources, but you must define a postfix or prefix to differenciate input and output.

Parameters:

  • outputDw: Name of the output dw
  • inputDataSources: List of datasources to synchronize
  • outputDataSourcesPrefix: Prefix to append to the names of datasources in the output
  • outputDataSourcesPostfix: Postfix to append to the names of datasources in the output
  • outputFolder: Folder where output datasource (not found) will be created.