(S)FTP Import task¶
This task lets you import a CSV formatted file or an XLSX file from an FTP or SFTP server.
For a FTP server, you need to provide a valid login/password. For a (S)FTP server, you can connect either with a login/password or with a keyfile.
Date formats for CSV files¶
When your CSV files has date or datetime columns, you must check in the output storage that these columns are detected as datetime. If not, you won't be able to filter on date ranges, to group the rows (e.g. by month or by quarter).
If the output columns are not datetime, the reason is that our engine didn't detect the content of your columns as datetime. The most common reason is that your data format is not in our list of detected formats, which is:
- "%Y-%m-%dT%H:%M:%S"
- "%Y-%m-%dT%H:%M:%S.%fZ"
- "%Y-%m-%d %H:%M:%S"
- "%Y-%m-%d %H:%M:%S.%f"
- "%d-%m-%Y"
- "%Y-%m-%d"
- "%d/%m/%Y"
- "%Y/%m/%d"
- "%m-%d-%Y"
- "%Y-%d-%m"
- "%m/%d/%Y"
- "%Y/%d/%m"
If your format is not in this list, you need to manually tell our engine that a given column is a datetime. To
do this, use the force types
field (see documentation below).
Use ISO format as much as you can
If you can, use the ISO format "%Y-%m-%dT%H:%M:%S" (e.g. 2019-10-21T14:45:23
).
It will be compatible with most of the tools you need to interact with.
Keyfile password¶
In case your keyfile is encrypted, you need to provide here its password.
Folder¶
Specify the folder on the (S)FTP server where the files will be requested.
File pattern¶
If you know the exact name of the file you want to import, enter it here. If you want to use a wildcard, use the
*
character. For example, if your files are my_data-2019-11-01.csv
, my_data-2019-11-02.csv
, ... you can use the
pattern my_data-*.csv
.
Wildcard pattern matching mode¶
In case the pattern matches several files on the server, the filenames are sorted in alphanumerical order. For example,
files ["my_data-2019-11-02.csv", "my_data-2019-11-03.csv", "my_data-2019-11-01.csv"]
will be sorted to
["my_data-2019-11-01.csv", "my_data-2019-11-02.csv", "my_data-2019-11-03.csv"]
. Then, by default, the last file of
the sorted filenames will be loaded (e.g. my_data-2019-11-03.csv
in this example).
In case you want to load the first file of the sorted filenames, use this option Wildcard pattern matching mode
to
specify to load the first file (e.g. to load my_data-2019-11-01.csv
in the above example).
Note that only one file is loaded per execution. If you need to load multiples files, you need to trigger several times the task.
Force types¶
In some specific cases, you may want to force the data types for some columns. You need to type them in a JSON format (note that you must use double quotes and not single quotes).
Example:
{"mycolumn1": "int32", "mycolumn2": "float64", "mycolumn_that_should_be_a_date": "datetime"}
The provided types must be one of int32
, int64
, float32
, float64
, datetime
, str
, str(X)
.
int64
is an integer on 64 bits (8 bytes), also referred asBigInt
in some databases.float64
is a double precision float.str(X)
is a varchar column with lengthX.
Which column name to use
You may have noticed that the output storage column names may be slightly different from your file column names.
When you specify the forced type of a column, you need to use the column name in the output storage, and not the
name from your file. For example, if you have a date of creation
column in your file, the output column name will
be date_of_creation
. You need to use this date_of_creation
name in the forced type field.
Flow mode¶
Full (default)¶
Replace all rows in storage with rows from the input file. All the data from the existing storage is deleted, and then the rows from the input file are loaded. The storage table is (re)created with columns from the imported file. So you can add/remove columns in the CSV file between imports in this mode.
Append¶
Append all rows from the input file to the storage. The columns in the loaded file must be exactly the same as in the storage table.
Key-based Update¶
If you choose this option, you must provide the name of the key column in the Flow key
field.
In this mode, for a given row in the input file:
- if its key already exists in the storage, the row in the storage is replaced by the row from the input file.
- else it is append to the storage
Note that the columns in the loaded file must be exactly the same as in the storage table.
Tip: use the filename as update key
You can use the special value $$SRN_FILENAME$$
for the Flow key
field. When using this value, the import will
update rows
coming from the imported filename which are already in the output storage. For example, if you have previously
imported a file name 'data_2019_06_04.csv' and this file has been modified on your server and you need to reimport
it: use the special $$SRN_FILENAME$$
update key and rows previously imported from this file will be updated and
new rows will be added.
This option requires the option Add filename as column
to be enabled.
Incremental Update¶
If you choose this option, you must provide the name of the key column in the Flow key
field.
The algorithm gets the max of the flow key column in the output storage and append rows from update dataset where key is strictly above this value. This is useful to add new Ids or new data according to a datetime column.
Note that the columns in the loaded file must be exactly the same as in the storage table.
Incremental Forced Update¶
If you choose this option, you must provide the name of the key column in the Flow key
field.
The algorithm gets the min of the flow key column in the update dataset (key column), remove all rows in output storage where key is above (or equal) this value, and then append the full update dataset to the output storage.
Note that the columns in the loaded file must be exactly the same as in the storage table.
Delete file on (S)FTP Server¶
When this option is enabled, the imported file will be deleted from server right after the import.
Add filename as column¶
When this option is enabled, a column __srn__filename__
is added to the output table. For each row, it contains the
filename it comes from. This is especially useful when using an update mode.