Pipeline-as-code in Yaml¶

In Serenytics, you can create pipeline to manipulate data using the user interface. You can also define a list of data transformation in a YAML file, and execute it.

Yaml pipelines handle three types of operations:

sync a list of datasources into another server
run a transformation step (select, groupby, where)
run a raw SQL query and store the result in a new table

The advantages of using the Yaml version rather than the UX version are:

a single file can contain many operations.
you can manage it in Git to manage new versions
it's a single operation to deploy it on a Serenytics instance
it's a single operation to run it

The more you need to industrialize your data pipelines, the better it is to use pipeline-as-code.

The next sections give details about the Yaml structure to use.

Here is an example with 3 steps:

configuration:
    notes: Here is a simple dataflow that sync two tables from a first datawarehouse to a second one and run an ETL step

steps:
    - name: Run a simple ETL step in a table in clickHouse
      type: TransformationStep
      inputDataSource: postgredemo <> orders
      outputDataSource: ch_users_by_category
      outputFolder: orders_analysis
      transformation:
          select:
            - name: customer_id
              agg: dcount
              alias: nb of users
            - name: category
          groupby:
            - name: category
          where:
            - name: category
              op: contains
              value: Bikes
    - name: Run a simple SQL Query
      type: SQLStep
      inputDw: postgredemo
      outputDw: postgredemo
      outputDatasource: pg_output_sql_yaml_step
      outputFolder: output_etl_yaml
      query: select category from orders
    - name: Sync postGreSQL to ClickHouse
      type: SyncListOfDatasources
      outputDw: Managed ClickHouse DW
      inputDataSources:
          - postgredemo <> customers
          - postgredemo <> orders
      outputDataSourcesPrefix: ch
      outputFolder: synced tables

TransformationStep¶

Get data from an input datasource, apply a transformation on it, and store the result in the output datasource.

If both input and output are on the same server (except for ClickHouse), the transformation will be executed as a single SQL Query. Otherwise, the result of the transformation will be stored into a file which will be loaded into the destination datasource.

Parameters:

inputDataSource: Name of the input DataSource
outputDataSource : Name of the output DataSource
outputDw [Optional]: If the outputDataSource does not already exist and if this param is defined, the outputDataSource will be created in this dw, otherwise, it will be created in the dw of the inputDataSource
outputFolder: Name of the output folder where to create output DataSource if it does not exist
transformation: Description of the transformation to apply. Must have 3 children: select, groupby and where.

Describe the transformation to apply¶

Select node¶

Parameters are:

name: name of the column
agg [Optional]: aggregation function to use if a groupby is defined. Values are: sum, avg, min, max, dcount
alias [Optional]: alias for the column in the output

Groupby node¶

Parameters

name: name of the column

Where node¶

List of where conditions. Each entry must have the following parameters:

name: name of the column
op: operator in the condition. Values are: '==', '<>', ">", ">=", "<", "<=", "in","!in","is_defined","is_not_defined","is_empty","is_not_empty","contains","!contains","is_true","is_false"
value: value in the condition

SQLStep¶

Run a SQL query in a datawarehouse and store the result in a table of a datawarehouse. Input and output can be two different datawarehouses of the same one.

Parameters:

inputDw : Name of the input dw
outputDw : Name of the output dw
outputDatasource: Name of the output datasource (created if not found)
outputFolder: Name of the output folder where to create datasources that does not exist
query: The SQL query

Full sync step¶

Synchronize a list of datasources from a dw to another one.

Destination datasources will have the same names as input datasources, but you must define a postfix or prefix to differenciate input and output.

Parameters:

outputDw: Name of the output dw
inputDataSources: List of datasources to synchronize
outputDataSourcesPrefix: Prefix to append to the names of datasources in the output
outputDataSourcesPostfix: Postfix to append to the names of datasources in the output
outputFolder: Folder where output datasource (not found) will be created.