Preparing Your Data

This article covers how to format and save your data before uploading to Emcien.

Choosing a File Format

To use Emcien, you'll need to save your data as a CSV file using one of the following Emcien formats:

The wide format is the most commonly used and is generally more universal. The wide format consists mostly of user-defined columns with no required columns.

The receipt format is usually reserved for transaction data containing unique transaction IDs and many combinations. The receipt format does not support user-defined columns and has several required columns.

Not sure which format is best for you? The Emcien team can help prepare your data. Contact us at [email protected].

Compression

Emcien supports CSV files compressed using GZip (.gz) compression. Compressed CSV files must still use either the wide or receipt format.

Character encoding

Emcien supports the following character encodings:

Line endings

Patterns uses either Unix-style line endings n or DOS line endings rn.

Wide Format

The wide format is the more universal and most commonly used Emcien format. You can use the wide format for multi-dimensional data, such as demographics or configurable products. In the wide format, each transaction is identified by a single row of data.

An example of the wide format is displayed below.

This example contains four transactions, each represented by a single row of data. In this data, each transaction represents a client and their associated demographics data.

Unlike the receipt format, the wide format has no required columns and supports user-defined columns.

Wide Format Details

Headers are required for each column in the wide format. Your file must contain at least two user-defined columns with a maximum of 1,000 total columns.

While the data is allowed to contain any UTF-8 characters the header must be in lower ASCII.

Most columns for wide format data are user-defined. Emcien also offers optional columns, which are listed in the table below. You can use these optional columns to use certain Emcien features, such as date and time trends. If used, your optional column header must exactly match the column header listed in the table below.

Column Header Required/Optional Format Description
date
OR
transaction_date
OR
time
OR
transaction_time
OR transaction_time_unix
Optional Date/ Time

The date and time when the transaction occurred. Field values must be formatted using ISO or Unix standards. Refer to the Date and Time section for more information.

Field values can contain date only, or date and time. The column heading indicates the date and time format used.

Important: If used, this column must be first.

volume Optional Integer

The volume applied to all attributes of the transaction. This column is used to calculate the strength of connections in your data.

Field values must be non-negative values consisting only of digits.

Important: If you use this column and not the transaction_date, transaction_time, or transaction_time_unix columns, this column must be first.

If you use this column and the transaction_date, transaction_time, or transaction_time_unix column, this column must be second.

<user-defined> Optional String

This column is defined by you. Each user-defined column must use a unique column header.

Important: Your data file must contain at least two user-defined columns.

Field values must be wrapped in double quotes () and cannot be longer than 32 characters.

All double quotes inside any string should be escaped with another double quote. For example:

“String 3″ Roll” should be changed to “String 3” Roll”

Receipt Format

The receipt format is typically used for transaction data containing many variables and combinations. In the receipt format, categories and their corresponding items are represented vertically and grouped based on a unique transaction ID.

An example of the receipt format is displayed below.

In the above example, two transactions are shown. Each transaction is identified using unique transaction_id :10001 and 10002. Transaction 10001 is associated with the purchase of 3 items. Transaction 10002 is associated with the purchase of 2 items.

Unlike the wide format, the receipt format does not support user-defined columns.

Using the Receipt Format for Non-Transaction Data

In addition to transaction data, you can use the receipt format for any data containing unique identifiers and repeating vertical rows.

An example of network data is displayed below.

In the above example, each network event is tracked by a unique identifier. This identifier is used to associate the event with different information in the item_id column.

Receipt Format Details

Headers are required for each column in the receipt format. Two columns, transaction_id and item_id, are required and must be included in your CSV file.

While the data is allowed to contain any UTF-8 characters the header must be in lower ASCII.

The receipt format does not support user-defined columns. Instead you can use optional columns, which are listed in the table below. You can use these optional columns to use certain Emcien features, such as date and time trends. If used, your optional column header must exactly match the column header listed in the table below.

Column Header Required/Optional Format Description
transaction_id Required String

A unique identifier for the transaction. This identifier can appear in multiple rows but should be different for each transaction.

Rows sharing the same transaction_id must be adjacent.

Field values must be wrapped in double quotes () and cannot be longer than 32 characters.

All double quotes inside any string should be escaped with another double quote. For example:

“String 3″ Roll” should be changed to “String 3” Roll”

item_id Required String

A unique identifier for the item. Typically, this column is used for SKU numbers and product numbers.

Field values must be wrapped in double quotes () and cannot be longer than 32 characters.

All double quotes inside any string should be escaped with another double quote. For example:

“String 3″ Roll” should be changed to “String 3” Roll”

transaction_date
OR
transaction_time
OR transaction_time_unix
Optional Date/ Time

The date and time when the transaction occurred. Field values must be formatted using ISO or Unix standards. Refer to the Date and Time section for more information.

Field values can contain date only, or date and time. The column heading indicates the date and time format used.

item_category Optional String

A text identifier used to categorize the item. Typically, this column is used for  product category

Field values must be wrapped in double quotes () and cannot be longer than 32 characters.

All double quotes inside any string should be escaped with another double quote. For example:

“String 3″ Roll” should be changed to “String 3” Roll”

item_name Optional String

A descriptive label of the item used in the item_id column.

For example, an item_id of RTJ00345 has the associated item_name of Green Striped Lawn Chair.

Field values must be wrapped in double quotes () and cannot be longer than 32 characters.

All double quotes inside any string should be escaped with another double quote. For example:

“String 3″ Roll” should be changed to “String 3” Roll”

item_volume Optional Float

The volume of item associated with the transaction. Typically, this volume is the quantity of items purchased.

Field values must be in the standard float format: a string of digits with a decimal point optionally among them. Precision beyond 4 decimal places is not used.

If the field value is NULL or blank, a default value of 1 is used.

item_price Optional Price

The individual price of the item associated with the transaction. Typically, this price is the manufacturer's suggested retail price (MSRP) or other undiscounted price.

Field values must be in dollars, and may begin with an optional $ sign. NULL or blank values are supported. Negative prices are not supported.

transaction_volume Optional Integer

The total volume of the entire transaction.

Note: This column should not be confused with the item_volume column, which is used for the volume associated with specific items.

Field values must be non-negative values consisting only of digits.

If the field value is NULL or blank, a default value of 1 is used.

transaction_price Optional Price

The total price of the entire transaction.

Note: This column should not be confused with the item_price column, which is used for the price associated with specific items.

Field values must be in dollars, and may begin with an optional $ sign. NULL or blank values are supported. Negative prices are not supported.

Date and Time

When using the date or transaction_date column, dates should be recorded using in the MM-DD-YYYY or MM/DD/YYYY format. You can zero pad month and day values.

To use the ISO data format (Y-M-D or Y/M/D), you can add _ISO or -ISO to the end of the date or transaction_date column header.

To use the AUS data format (D-M-Y or D/M/Y), you can add _AUS or -AUS to the end of the date or transaction_date column header.

When using the time, transaction_time, or transaction_time_unix column, dates should be recorded using the ISO-8601 or UNIX time formats.

The ISO-8601 format is YYYY-MM-DDTHH:MM:SSZ
T and the Z are literal characters. Z is for Zulu. All of the other letters should be a digit.

The UNIX time format is the number of seconds since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970.

Naming Your Data File

Emcien uses a file naming convention to identify the Emcien format used. File should be named using the following structure:

<filename>.<file type>.<extension>

filename Filenames should use only the following charaters:
  • ASCII alpanumeric characters in upper (A-Z) or lower (a-z) case
  • Periods
  • Underscores
  • Hyphens
  • Parentheses

The following characters are not supported:

  • Whitespace
  • Backslash ()
  • Single quote (')
  • Double quotes ()
file type receipt or wide

Important: If no file type is included, the default receipt format is used. This will result in loading errors for wide format files.

extension .csv or .csv.gz

File Name Examples

area2-sales.receipt.csv Uncompressed sales data from area 2 in the receipt format
members.us.wide.csv Uncompressed U.S. membership data in the wide format
area3-sales.csv.gz Compressed sales data from area 3 in the receipt format
clinical.all.wide.csv.gz Compressed clinical test data in the wide format

Once you've prepared your data it can be uploaded to Emcien for analysis. Check out the Loading Your Data article for more information. You can also email us at [email protected] for help getting started.