Pipeline-as-code in Yaml¶
In Serenytics, you can create pipeline to manipulate data using the user interface. You can also define a list of data transformation in a YAML file, and execute it.
Yaml pipelines handle three types of operations:
- sync a list of datasources into another server
- run a transformation step (select, groupby, where)
- run a raw SQL query and store the result in a new table
The advantages of using the Yaml version rather than the UX version are:
- a single file can contain many operations.
- you can manage it in Git to manage new versions
- it's a single operation to deploy it on a Serenytics instance
- it's a single operation to run it
The more you need to industrialize your data pipelines, the better it is to use pipeline-as-code.
The next sections give details about the Yaml structure to use.
Here is an example with 3 steps:
configuration:
notes: Here is a simple dataflow that sync two tables from a first datawarehouse to a second one and run an ETL step
steps:
- name: Run a simple ETL step in a table in clickHouse
type: TransformationStep
inputDataSource: postgredemo <> orders
outputDataSource: ch_users_by_category
outputFolder: orders_analysis
transformation:
select:
- name: customer_id
agg: dcount
alias: nb of users
- name: category
groupby:
- name: category
where:
- name: category
op: contains
value: Bikes
- name: Run a simple SQL Query
type: SQLStep
inputDw: postgredemo
outputDw: postgredemo
outputDatasource: pg_output_sql_yaml_step
outputFolder: output_etl_yaml
query: select category from orders
- name: Sync postGreSQL to ClickHouse
type: SyncListOfDatasources
outputDw: Managed ClickHouse DW
inputDataSources:
- postgredemo <> customers
- postgredemo <> orders
outputDataSourcesPrefix: ch
outputFolder: synced tables
TransformationStep¶
Get data from an input datasource, apply a transformation on it, and store the result in the output datasource.
If both input and output are on the same server (except for ClickHouse), the transformation will be executed as a single SQL Query. Otherwise, the result of the transformation will be stored into a file which will be loaded into the destination datasource.
Parameters:
inputDataSource
: Name of the input DataSourceoutputDataSource
: Name of the output DataSourceoutputDw [Optional]
: If the outputDataSource does not already exist and if this param is defined, the outputDataSource will be created in this dw, otherwise, it will be created in the dw of the inputDataSourceoutputFolder
: Name of the output folder where to create output DataSource if it does not existtransformation
: Description of the transformation to apply. Must have 3 children: select, groupby and where.
Describe the transformation to apply¶
Select node¶
Parameters are:
name
: name of the columnagg [Optional]
: aggregation function to use if a groupby is defined. Values are: sum, avg, min, max, dcountalias [Optional]
: alias for the column in the output
Groupby node¶
Parameters
name
: name of the column
Where node¶
List of where conditions. Each entry must have the following parameters:
name
: name of the columnop
: operator in the condition. Values are: '==', '<>', ">", ">=", "<", "<=", "in","!in","is_defined","is_not_defined","is_empty","is_not_empty","contains","!contains","is_true","is_false"value
: value in the condition
SQLStep¶
Run a SQL query in a datawarehouse and store the result in a table of a datawarehouse. Input and output can be two different datawarehouses of the same one.
Parameters:
inputDw
: Name of the input dwoutputDw
: Name of the output dwoutputDatasource
: Name of the output datasource (created if not found)outputFolder
: Name of the output folder where to create datasources that does not existquery
: The SQL query
Full sync step¶
Synchronize a list of datasources from a dw to another one.
Destination datasources will have the same names as input datasources, but you must define a postfix or prefix to differenciate input and output.
Parameters:
outputDw
: Name of the output dwinputDataSources
: List of datasources to synchronizeoutputDataSourcesPrefix
: Prefix to append to the names of datasources in the outputoutputDataSourcesPostfix
: Postfix to append to the names of datasources in the outputoutputFolder
: Folder where output datasource (not found) will be created.