Data source formulas

Data source formulas are new data columns computed on-demand. There exists three types of such formulas:

  • simple column
  • conditional column
  • value

Examples

Let's say your data contains a column [Price], a column [Cost] and a column [Date]. Here are some examples of what new columns you can create.

Simple formulas

  • Simple calculus
1
[Price] * 10
  • Using two columns:
1
[Price] / [Cost] * 100
  • With parentheses:
1
([Price] - [Cost]) / [Price] * 100
  • Operation on Date
1
extract_date_part([Date], "year")

Conditional formulas

  • Simple IF condition:
1
[Price] == 100
  • Using and operator:
1
[Price] > 10 and [Country] != "USA"
  • Using or operator:
1
[Price] > 10 or [Cost] < 20
  • Using in operator
1
[Country] in ("USA", "Germany", "France")
  • Using full dates:
1
[Date] > "2010-10-24" and [Date] < "2010-10-30"
  • Using date parts:
1
extract_date_part([Date], "year") >= 2014
  • Using string comparison:
1
like([postal_code], "_____")

Value formulas

  • Simple total:
1
sum([Price])
  • Conditional total:
1
sum_if([Price], [Cost] > 10)
  • Ratio:
1
sum_if([Price], [Cost] > 10) / sum([Price])

Formulas of formulas

You can use a formula within another formula. As an example, you can create a simple formula named "GOLD Product":

1
[Category] in ["Jewelry", "Watch", "Leather goods"]

and then use it in a second formula to compute the ratio of the revenue of these GOLD products within the total revenue:

1
sum_if([Price], [GOLD Product]) / sum[Price])

Date-time functions

extract_date_part()

Extract date part of a datetime column.

Parameters:

  • column: column, formula or function
  • date_part: can be "year", "month", "quarter", "week", "day", "weekday", "hour" or "minute".
  • language[optional]: can be "en", "fr", "num-fr", "num-en". Use this option to get a literal value instead of a numerical value (valid only for months). Using the num prefix makes the data sortable (e.g. 01-January, 02-February, ...).

Example:

1
extract_date_part([MyColumn], "week")

Availability: all data source types.


utc_to_timezone()

Convert a column in UTC to the specified timezone.

Parameters:

  • column: column, formula or function
  • timezone: a valid timezone such as "US/Eastern" or "Europe/Paris". See full list.

Example:

1
utc_to_timezone([MyColumn], "Europe/Paris")

Availability: all data source types.


date_diff()

Return difference between 2 date-times in specified time unit. The result is an integer.

More precisely, date_diff() determines the number of unit of time boundaries that are crossed between the two datetimes. The result is an integer (e.g.: 1 for the difference in days between "2016-10-15 23:00:00" and "2016-10-16 03:00:00").

Parameters:

  • first_date: column, formula or function
  • second_date: column, formula or function
  • time_unit[optional]: a valid time unit. Default: "day". It can be one of: "year", "month", "day", "hour", "minute", "second".

Example:

1
date_diff([MyDateColumn1], [MyDateColumn2], "hour")

Availability: only MySQL, SQL Server, Redshift and Storage. On MySQL, the only time_unit available is the default one: "day".


time_delta()

Return difference between 2 date-times in specified time unit. The result is a float: e.g. 1.083 for the delta between "2016-10-15 14:00:00" and "2016-10-16 16:00:00"

Parameters:

  • first_date: column, formula or function
  • second_date: column, formula or function
  • time_unit[optional]: a valid time unit. Default: "day". It can be one of: "day", "hour", "minute", "second".

Example:

1
time_delta([MyDateColumn1], [MyDateColumn2], "hour")

Availability: All source types.


now()

Return current date-time.

Availability: All source types.


date()

Create a date from year, month and day data.

Parameters:

  • year: column, formula or function that returns the year of the date to create.
  • month[optional]: column, formula or function that returns the month of the date to create. Default: 1 (January).
  • day[optional]: column, formula or function that returns the day of the date to create. Default: 1 (first day of the month).

Example:

1
date([YearColumn], [MonthColumn])

Availability: All source types except Microsoft SQL Server.


date_add()

Add a certain interval to a date.

Parameters:

  • myDate: column, formula or function containing the initial date.
  • interval_type: type of interval to add. It can be one of: "day", "hour", "minute", "second".
  • nb_intervals: number of intervals to add. A negative value will substract the interval.

Example:

1
2
# to add 10 days to a given date
date_add([myDate], "day", 10)

Availability: All source types except files and APIs.


Date-time filtering

For date/time columns, you can use special functions to compute conditions.

Examples

If your data contains a column [Timestamp], you can filter it on:

  • today
1
[Timestamp] in day(0)
  • yesterday
1
[Timestamp] in day(-1)
  • last 24 hours
1
[Timestamp] >= hour(-24)
  • current month
1
[Timestamp] In month(0)
  • last 2 months with a rolling window (= last 62 days)
1
Timestamp] >= days(-62)
  • current month until yesterday but not including today
1
[Timestamp] >= month(0) and [Timestamp] < day(0)
  • current year to date (from first day of the year to yesterday included)
1
[Timestamp] In year_to_date(0)
  • previous month to date (from first day of previous month to same day as today, last month)
1
[Timestamp] In month_to_date(-1)

Note on date/time data bounds

  • timestamp <= period(arg) means all timestamps until the end of the period included
  • timestamp < period(arg) means all timestamps until before the beginning of the period
  • timestamp >= period(arg) means all timestamps since the beginning of the period included
  • timestamp > period(arg) means all timestamps since after the end of the period

  • period(0) means the current period (e.g. today)

  • period(-1) means the previous period (e.g. yesterday), and so on.

  • period_to_date(0) means the current period until yesterday (included). For year_to_date(0), if today is 2018-06-20, the range starts at 2018-01-01 00:00:00 and ends at 2018-06-19 23:59:59

  • period_to_date(-1) means from first day of previous period to same day as today in previous period. For year_to_date(-1), if today is 2018-06-20, the range starts at 2017-01-01 00:00:00 and ends at 2017-06-19 23:59:59

Available functions

  • minute(arg)
  • hour(arg)
  • day(arg)
  • week(arg)
  • week_to_date(arg)
  • month(arg)
  • month_to_date(arg)
  • quarter(arg)
  • quarter_to_date(arg)
  • year(arg)
  • year_to_date(arg)

String functions

like()

Compare column values to pattern.

Parameters:

  • column: column, formula or function
  • pattern: Pattern to match. Can contain wildcards (% matches zero or more characters, _ matches a single character).

Example:

1
2
3
4
# returns true if zip_code length is five and first two digits are "75"
like([zip_code], "75___")
# returns true if url contains the string "/product/"
like([url], "%/product/%")

Availability: all data source types.


regexp_match()

Return True if a string matches a regular expression else returns False.

Parameters:

  • column: column, formula or function.
  • pattern: Regular expression to match.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# returns true if zip_code length is five and first two digits are "75" and last 3 digits are numbers
regexp_match([zip_code], "^75[0-9]{3}$")
# details:
# ^ matches the begining of the string
# 75 matches the string "75"
# [0-9] matches a decimal between 0 and 9
# {3} indicates that [0-9] must be matched exactly three times
# $ matches the end of the string

# returns true if a string is a date with format YYYY-MM-DD
regexp_match([date_str], "^[0-9]{4}-[0-9]{2}-[0-9]{2}$")

Availability: all data sources types except MS SQLServer.


left()

Returns the n first characters of a string.

Parameters:

  • column: column, formula or function
  • n: number of characters to extract

Example:

1
left([MyColumn], 5)

Availability: all data source types.


Returns the last n characters of a string.

Parameters:

  • column: column, formula or function
  • n: number of characters to extract

Example:

1
right([MyColumn], 5)

Availability: all data source types.


str_len()

Returns the length of a string.

Parameters:

  • column: column, formula or function

Example:

1
str_len([MyColumn])

Availability: all data source types.


lower_case()

Returns column formatted in lower case.

Parameters:

  • column: column, formula or function

Example:

1
lower_case([MyColumn])

Availability: all data source types.


upper_case()

Returns column formatted in upper case.

Parameters:

  • column: column, formula or function

Example:

1
upper_case([MyColumn])

Availability: all data source types.


lpad()

Returns column, left-padded to length nb_chars characters with padding_char, replicated as many times as necessary.

Parameters:

  • column: column, formula or function to pad
  • nb_chars: length of output string
  • padding_char: character to use for padding

Example:

1
lpad([MyColumn], 4, "0")

rpad()

Returns column, right-padded to length nb_chars characters with padding_char, replicated as many times as necessary.

Parameters:

  • column: column, formula or function to pad
  • nb_chars: length of output string
  • padding_char: character to use for padding

Example:

1
rpad([MyColumn], 4, " ")

split()

Splits a string on the specified delimiter and returns the part at the specified position.

Parameters:

  • column: column, formula or function
  • delimiter: delimiter string
  • part: position of the portion to return (counting from 1).

Example:

1
split([MyStringColumn], "/", 2)

Availability: all data source types except Microsoft SQL Server.


substring()

Extract a substring from column values.

Parameters:

  • column: column, formula or function
  • first_char: Index of first char extracted. Starts at 1.
  • nb_chars: Length of extracted substring

Example:

1
substring([MyStringColumn], 1, 5)

Availability: all data source types.


Conversion functions

numeric()

Convert string column values to numeric values. If the string cannot be converted to a numeric value, the result is 0.

Parameters:

  • column: column, formula or function

Example:

1
numeric([MyStringColumn])

Availability: only files.


str()

Convert column values to string values.

Parameters:

  • column: column, formula or function

Example:

1
str([MyColumn])

Availability: all data source types.


round()

Convert column values to their closest integer values. Float values are casted to integer values. Undefined values are casted to zero. For example, round(3.1) returns 3 and round(3.9) returns 4.

Parameters:

  • column: column, formula or function

Example:

1
round([MyColumn])

Availability: all data source types.


to_int()

Convert column values to integer values. Float values are casted to integer values. Undefined values are casted to zero.

to_int floors the value, i.e. it returns the largest integer less than or equal to the input value. Use round if you need to round a value to the closest integer number.

Parameters:

  • column: column, formula or function

Example:

1
to_int([MyColumn])

Availability: all data source types.


datetime_from_string()

Convert a string to a Datetime.

The function will return an error if the passed string does not match the given format in a single row. If you are not sure about the quality of your data, use the function regexp_match to check the format. And if it doesn't match for a row, return null() instead of datetime_from_string.

Parameters:

  • string: column, formula or function.
  • format: the format of the input string to convert. If your datasource is a file, use the Python strftime format (see documentation strftime.org). If your datasource is a PostGreSQL table or a Serenytics storage, use the PostGreSQL syntax (see documentation postgresql.org. If your database is a MySQL table, use the MySQL syntax (see documentation dev.mysql.com).

Example:

In this example, the input column contains string such as: 2013--01--23 21:17

1
2
3
4
5
6
7
8
# for a file
datetime_from_string([date_str], "%Y--%m--%d %H:%M")

# for a PostGreSQL or a Serenytics storage
datetime_from_string([date_str], "YYYY--MM--DD HH24:MI")

# for a MySQL table
datetime_from_string([date_str], "%Y--%m--%d %H:%i")

Availability: All source types except Microsoft SQL Server.


json_extract_path_text()

Extract a value within a string representing a JSON. The parameters specify the path to the value. It returns the empty string if the path does not exist in the JSON.

Parameters:

  • string: column, formula or function.
  • path1: first key of the JSON path
  • [path2]: second key of the JSON path
  • up to 5 keys

Availability: Only Serenytics Storages, Redshift and PostGreSQL.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# if the value is '{"first_name":"john", "last_name": "doe"}', returns 'john'.
json_extract_path_text([json_str], "fist_name")

# if the value is '{"first_name":"john", "last_name": "doe", "address": {"city" : "paris"}}'
# returns 'paris'.
json_extract_path_text([date_str], "address", "city")

# if the value is '{"first_name":"john", "last_name": "doe", "address": {"city" : "paris"}}'
# returns an empty string.
json_extract_path_text([date_str], "address", "zipcode")

Ranking functions

is_first()

Return 1 when current value appears for the first time in column, 0 otherwise. Can be used to work with unique values or remove doublons.

Parameters:

  • column: column, formula or function

Example:

1
is_first([MyColumn])

Availability: only files.


ntile()

Sort rows according to column, split them in nb_buckets having the same number of rows and return the id of the bucket each row falls in. First bucket with smallest values has id 1.

Parameters:

  • column: column, formula or function
  • nb_buckets: number of buckets

Example:

1
ntile([MyColumn], 100)

Availability: all data source types.


rank()

Sort rows according to column in order (either "asc" or "desc") and return the rows ranking. If several rows have the same value, they get the same ranking and there will be a discontinuity in the result. If partition_byis specified, the ranking is computed for each value of this column.

Parameters:

  • column: column, formula or function
  • order: either "asc" or "desc"
  • partition_by[optional]: column, formula or function

Example:

1
rank([MyColumn], "asc")
1
2
# compute the orders rankings, by customer
rank([order_date], "asc", [order_customer_id])

Availability: all data source types.


dense_rank()

Sort rows according to column in order (either "asc" or "desc") and return the rows ranking. If several rows have the same value, they get the same ranking. The dense_rank function differs from rank in one respect: if two or more rows tie, there is no gap in the sequence of ranked values. For example, if two rows are ranked 1, the next rank is 2. If partition_byis specified, the ranking is computed for each value of this column.

Parameters:

  • column: column, formula or function
  • order: either "asc" or "desc"
  • partition_by[optional]: column, formula or function

Example:

1
dense_rank([MyColumn], "desc")
1
2
# compute the orders rankings, by customer
rank([order_date], "asc", [order_customer_id])

Availability: all data source types.


quantile()

Return the value from column at the quantile given by fraction.

Parameters:

  • column: column, formula or function
  • fraction: quantile fraction (e.g. 0.5 for the median)
  • interpolation: interpolation method to use when the desired quantile lies between two data points. Choices are: "linear" or "nearest"

Example:

1
2
# compute median value:
quantile([MyColumn], 0.5, "linear")

Availability: all data source types except MySQL.


Value functions

sum()

Compute the sum of column.

Parameters:

  • column: column, formula or function

Example:

1
sum([MyColumn])

sum_if()

Compute the sum of column for rows where condition is true.

Parameters:

  • column: column, formula or function
  • condition: boolean expression

Example:

1
sum_if([column], [MyOtherColumn] == 3)

avg(), avg_if()

Return the average. See sum and sum_if for more details.

min(), min_if()

Return the min. See sum and sum_if for more details.

max(), max_if()

Return the max. See sum and sum_if for more details.

count(), count_if()

Return the number of data rows. See sum and sum_if for more details.

dcount(), dcount_if()

Return the distinct count (number of different values). See sum and sum_if for more details.


last_value()

Get the last value of value_column when ordering by order_column.

Parameters:

  • value_column: column, formula or function
  • order_column: column, formula or function used to order the values of value_column

Example:

1
2
# get the most recent value of a time serie
last_value([value], [date])

last_value_if()

Get the last value of value_column for rows where condition is true when ordering by order_column.

Parameters:

  • value_column: column, formula or function
  • order_column: column, formula or function used to order the values of value_column
  • condition: boolean expression

Example:

1
2
# get the last value in the previous month
last_value_if([value], [date], [date] in month(-1))

Data-quality functions

is_defined()

Returns False if column is NULL or an empty string, otherwise returns True. Use not is_defined([myCol]) to test if a column is not NULL.

Parameters:

  • column: column, formula or function

Example:

1
is_defined([MyColumn])

is_empty()

Returns True if column is an empty string, False otherwise. Column must be of type String. Use not is_empty([myCol]) to test if a column is not empty.

Parameters:

  • column: column, formula or function

Example:

1
is_empty([MyColumn])

Misc

null()

Returns the NULL value.

This can be used when you need to dcount values of a column, but you want to exclude some rows. You can create a conditional formula that will return null() for the rows to exclude. And then, apply your dcount on this conditional formula. The null rows won't be counted in the dcount.

To check if a column is NULL or empty, it's simpler to use the functions is_defined or is_empty.

Example:

Using a table with a list of purchases, you can count the distinct number of customers who has bought items with a price higher than 100 doing a dcount on the below formula. This will also work if you group the purchases by any dimensions (e.g. grouping by the category of products in a table widget).

1
2
IF [price]>100 THEN [client_id]
ELSE null()