Skip to content

ETL Step

An ETL step takes data from an input data-source, applies transformations and stores the result in a table of the internal datawarehouse. You can run several ETL steps to process your data. This is usually called a Data-Flow. You can use a Chain-Script to trigger several ETL steps one after each other.

Before using an ETL step, you must have created the output storage table. To achieve that, use the Data menu and create a new datasource of type Storage (in the Serenytics's Datawarehouse tab).

Possible transformations are:

  • Select columns or formulas from the input data source
  • Apply filters
  • Aggregate data according to dimensions
  • Rename columns (by clicking on the columns header in the preview)

In the configuration tab, the displayed table is just a preview. The full computation is not computed. To actually execute store the result in the output table, you must use the Execution tab (either by clicking on Run Now, or by scheduling the automation).

Warning

If your input data source is a Serenytics storage, the computation is optimized to handle large datasets (e.g. hundreds of millions of rows). Because of this optimization, if you aggregate by time dimensions, the preview format of the time dimensions might be different between the preview and the actual result.

Warning

If the input data source is a Serenytics storage, there is no limit on the number of rows you can manipulate with an ETL step. For any other type of data source, the limit is 1.000.000 rows.

Append mode

In default mode, the execution of an ETL step overwrites all the existing rows contained in the output source (and replaces them by the result of its query). When the append mode is activated, the result of the ETL query is appended to the existing rows of the output data source.

Warning

The output source must already have its columns set (e.g. created by a first ETL step without append mode). The output columns (names and types) of the ETL step in append mode must be identical to the existing columns of the current output source. Otherwise, the ETL step will fail.

Another way to run an ETL step in append mode is to call it using a Python script. If you run an ETL step script and specify the append parameter in the script, it will overwrite the append mode defined in the ETL step parameters.

import serenytics
client = serenytics.Client()
MY_SCRIPT_UUID = 'XXXXX'

script = client.get_script_by_uuid(uuid=MY_SCRIPT_UUID)

params = {
  'append': True,
}
script.run(params=params, async_=False)

Remove duplicates

If you enable the Remove duplicates option, an extra step will be added to the processing to delete duplicate rows from the output. This step is applied after the transformation.

The first parameter to choose is the Unique column. This is the column that must contain only unique values in the output table. Very often, this is an ID such as customer_id or product_id.

If you run the ETL step, the output will not contain any duplicate of this value. For example, if you have 5 rows having the value customer_id=128 in the input, only one of these rows will exist in the output. And 4 rows will be dropped.

The next parameters allow you to choose which rows must be dropped in case of duplicates.

If you keep the Default mode value in the Duplicate to keep's mode option, you have no control on this choice.

If you choose Max of a column, the algorithm will sort the duplicates rows and keep the row having the maximum value. The sort column can be selected in the sort column parameter.

And selecting the Min of a column value will do the opposite.

Duplicate are removed after the transformation step

An ETL step will first store the result of your transformation (groupby, select, filter) in the output table. And then, it will remove the duplicates. If your transformation requires the input to have no duplicate, you need to add an extra ETL step (and an extra table) to be run before your transformation step. This is particularly important if you have groupby and $$count$$ in your transformation as the $$count$$ result will include all the duplicates.