Sync datasource to storage¶
This automation can be used to synchronize rows from an input datasource into a storage (i.e. a table in a datawarehouse). A classic use case is when you have a SQL server with sales and customers and you want to synchronize it into a datawarehouse dedicated to BI. There are options to enable a simple or incremental synchronization.
Full Mode (default)¶
In this mode, the input datasource is fully exported and then rows are imported in the destination storage. If the destination storage had rows before the operation, these rows are deleted.
This is the simplest mode. It is the mode to choose if your datasets are not too large.
Incremental Append¶
This mode is used to synchronize a datasource into a storage by batch. This is useful when you need
to synchronize a large SQL table into a storage and the Full Mode
takes too much time (or does not work
because of the quantity of data to transfer).
In this mode, you need to define:
- a
Key
column: this key must be unique for each row. And it must be only increasing (i.e. new rows created in your input table must have key values above previous values). This is often a tech id (e.g. ticket_id for a table with sales). - A
max number of rows per iteration
: this is the number of rows that each execution will retrieve (at maximum).
At each iteration, the automation will extract the max value of the Key
column existing in the destination
storage. Then, it will query for the rows in the input datasource where the Key
value is above this max value.
And all these rows will be copied from the input datasource into the storage. If the maximum number of rows is
reached, only the rows having the smallest values of Key
are synchronized.
For example, let's say you have a table with 10 millions rows. You can set the limit to 1 millions rows. If you run the automation 10 times, all the rows will be synchronized in the destination storage.
Of course, you can schedule this automation to add new rows on a regular basis.
Modified rows in input datasource
Keep in mind that this mode does not update the modified rows. Once a row is synchronized from the input into the destination, even if it is modified in the input datasource, it won't be updated in the destination storage. If you need this behavior, please use the Key-based Update mode.
Use this Incremental Append mode if you need to synchronize rows which are only created and never modified in the input datasource.
Both datasets (destination and update) must have the exact same data model (i.e. columns and types).
Key-based update mode¶
This mode allows you to update a destination storage with modified data from an input datasource.
In this mode, you need to define:
- a
Key
column: this key must be unique for each row. In this mode, it does not have to be increasing. This is often a tech id (e.g. ticket_id for a table with sales) Last modified date
column: the name of thelast modified date
column in the input datasource. This column is used to retrieve the data that has been modified recently.Nb days backward in source to use as update
: the number of days backward that the algorithm will look to search for new data. Default value is 30 days.
For simplicity, let's say you used 5 for Nb days backward in source to use as update
. With this value,
when the automation is executed, it will query the input dataset for all the rows where the column
Last modified date
is after Now() - 5 days
.
Then, for each row of this dataset, its content is copied from the input datasource into the destination storage:
- for a given row, if its
Key
value does not exist in destination storage, it is copied as a new row (this is an "INSERT"). - for a given row, if its
Key
value already exists in the destination storage, the row is updated with its new content from the input datasource (this is an "UPDATE").
Because this algorithm enable UPDATE and INSERT, it can be called an "UPSERT" algorithm.
Both datasets (destination and update) must have the exact same data model (i.e. columns and types).
Deletion
Keep in mind that this algorithm does not handle deletion. If a row is deleted from the input datasource,
it won't be deleted in the destination storage. Usually, deletion is achieved with a column is_deleted
.
In this case, marking a row as deleted will be synchronized into the destination storage.
How to setup a synchronization for a large table¶
To synchronize a large table, we usually create two automations:
- the first one is the "Init". It uses the Incremental Update mode to copy all the rows from the input datasource into the destination storage. We usually schedule it every five minutes until the copy is entirely finished (and then, it is unscheduled).
- the second one is the "Upsert". It uses the Key-based update mode. It gets only the latest new rows/modifications in the input datasource and update the storage with it.
For the second automation, one advantage of using an update window of a few days (ie. with the parameter
Nb days backward in source to use as update
) is that if the automation fails for any reason (e.g.
the input system is not reachable/down), everything will be fixed when the new execution is done. If you set
this window to one day, and the input system is not reachable for a few days. You will have to increase the
window size before you run it again to fill the gap of missing data.