Pull from Amazon S3¶
Amazon Simple Storage Service (Amazon S3) stores customer data files of any size in any file formats.
Amperity can pull data from Amazon S3. A common scenario: a file is output from a customer data source with a consistent datestamp pattern appended to a static file name, added to an Amazon S3 bucket, and then updated on a regular basis. Amazon S3 can be a source for any number of file types and formats. A courier can be configured to ingest multiple file types and formats as a fileset.
This topic describes the steps that are required to pull files in any supported format to Amperity from Amazon S3:
Get details¶
Amperity can be configured to pull data from Amazon S3. This may be done using cross-account role assumption (recommended) or by using IAM credentials.
Use cross-account roles¶
Amperity prefers to pull data from customer-managed cloud storage. This approach ensures that customers can:
Use cross-account role assumption to manage access to data
Directly manage the files that are made available
Modify access without requiring involvement by Amperity; access may be revoked at any time by either Amazon AWS account, after which data sharing ends immediately
Directly troubleshoot incomplete or missing files
Amperity recommends to use cross-account role assumption to manage access to customer-managed cloud storage in Amazon S3. This allows managed security policies to control access to data.
After setting up cross-account role assumption, a list of files (by filename and file type), along with any sample files, must be made available to allow for feed creation. These files may be placed directly into the shared location after cross-account role assumption is configured.
Can I use an Amazon AWS Access Point?
Yes, but with the following limitations:
The direction of access is Amperity access files that are located in a customer-managed Amazon S3 bucket
A credential-free role-to-role access pattern is used
Traffic is not restricted to VPC-only
To send data to Amperity using cross-account role assumption
The following steps describe how to configure Amperity to use cross-account role assumption to pull data from (or push data to) a customer-managed Amazon S3 bucket.
Important
These steps require configuration changes to customer-managed Amazon AWS accounts and must be done by users with administrative access.
Select the iam-role-to-role credential type.
The Amazon Resource Name (ARN) for Amperity is automatically provided.
Enter the ARN for IAM role that is used to access the customer-managed Amazon S3 bucket.
The external ID that is used to access the customer-managed Amazon S3 is automatically provided.
Enter the name of the customer-managed Amazon S3 bucket.
Review the sample policy, and then add it as a trusted policy to the IAM role that is used to manage access to the customer-managed Amazon S3 bucket. This is unique to your tenant, but will be similar to:
{ "Statement": [ { "Sid": "AllowAmperityAccess", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::485692797166:role/prod/amperity-prod-plugin" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "fad53f02ec022ddac9c99d4b2f39fc39ad451ed46b841648e1f28c94e1b36a0a" } } } ] }
Use credentials¶
Amazon S3 requires the following configuration details:
The IAM access key.
The IAM secret key.
The Amazon Resource Name (ARN) for a role with cross-account role assumption. (This is the recommended way to define access to customer-managed Amazon S3 buckets.)
The name of the Amazon S3 bucket.
A list of objects (by filename and file type) in the Amazon S3 bucket to be pulled to Amperity.
A sample for each file to simplify feed creation.
Use the iam-credential credential type for this configuration.
Optional workflows¶
Some workflows pull data to Amperity using Amazon S3 as a way to make available data from Amazon Redshift and Kinesis Data Firehose.
Amazon Redshift¶
Amazon RedShift is a data warehouse located within Amazon Web Services that can handle massive sets of column-oriented data.
You may configure Amazon Redshift to unload data to an Amazon S3 bucket that your organization owns and manages by using the UNLOAD command . For example, to output a single CSV file, use a command similar to:
unload ('select * from table')
to 's3://bucket/from/which/Amperity/pulls/unload/table_'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
parallel off;
CSV
and then configure Amperity to pull that CSV file from the customer-managed Amazon S3 bucket.
Kinesis Data Firehose¶
Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to Amazon S3.
You may configure any supported data producer to use Kinesis Data Firehose services to automatically send real-time streaming data to Amazon S3, and then make that data available to Amperity. Amperity can be configured to pull the real-time data (in batches) from any Amazon S3 location. There are two options:
Send this data to the Amazon S3 bucket that is part of the Amperity tenant (if that tenant is running on Amazon AWS). This option requires coordination between an administrator for the customer’s Amazon AWS account and with a representative from Amperity.
Send this data to a customer-owned Amazon S3 bucket, and then configure Amperity to pull data from that bucket.
To configure Kinesis Data Firehose to send data to the Amperity S3 bucket
Create a cross-account role IAM role in the customer’s Amazon AWS account.
This role is required to grant Kinesis Firehose access to the Amazon S3 bucket that is part of the Amperity tenant. This role must have s3:PutObjectAcl configured as part of the list of allowed Amazon S3 actions.
Configure the bucket policy in the Amperity Amazon S3 bucket to allow the IAM role access to the Amazon S3 bucket.
Create a Kinesis Data Firehose delivery stream in the customer’s cloud infrastructure that uses this IAM role.
Configure the delivery stream to send data to the Amperity Amazon S3 bucket.
Configure applications to send data to the delivery stream.
Amperity PGP encryption. This may be done with a policy on the Amperity Amazon S3 bucket that is configured by Amperity.
Record separators¶
Data records are delivered to Amazon S3 as an Amazon S3 object. If you need to ensure that individual records are available to Amperity in Amazon S3, you will need to configure the delivery stream from Kinesis Data Firehose to add a record separator at the end of each data record.
Filename patterns¶
Recommended filename patterns include:
Using the YYYY/MM/DD/HH format when writing objects to Amazon S3 from Kinesis Data Firehose. This prefix will create a logical hierarchy in the bucket by year, then month, then date.
Using the default Amazon S3 object naming pattern that increments (by an increase of 1) a random string at the end of the object’s filename.
Delivery frequency¶
The Amazon S3 buffer size and interval will determine the frequency of delivery . Incoming records will be concatenated based on the frequency of the delivery stream.
Warning
If data fails to deliver to Amazon S3, Kinesis Data Firehose will retry for up to 24 hours. If data fails to deliver within 24 hours, the data will be lost, unless it is successfully delivered to a backup location. (You can re-send data if it’s backed up.)
Delivery failures¶
Kinesis Data Firehose will retry for up to 24 hours. The maximum data storage time for Kinesis Data Firehose is 24 hours . Data will be lost if delivery does not succeed within 24 hours. Consider using a secondary Amazon S3 bucket as a backup for data that cannot be lost.
Note
Delivery retries may introduce duplicates.
Add courier¶
A courier brings data from external system to Amperity. A courier relies on a feed to know which fileset to bring to Amperity for processing.
Tip
You can run a courier without load operations. Use this approach to get files to upload during feed creation, as a feed requires knowing the schema of a file before you can apply semantic tagging and other feed configuration settings.
Example entities list
An entites list defines the list of files to be pulled to Amperity, along with any file-specific details (such as file name, file type, if header rows are required, and so on).
For example:
[
{
"object/type": "file",
"object/file-pattern": "'/path/to/CustomerRecords.csv'",
"object/land-as": {
"file/header-rows": 1,
"file/tag": "customer-records-2019",
"file/content-type": "text/csv"
}
},
{
"object/type": "file",
"object/file-pattern": "'/path/to/TransactionRecords.csv'",
"object/land-as": {
"file/header-rows": 1,
"file/tag": "transaction-records-2019",
"file/content-type": "text/csv"
}
}
]
To add a courier for Amazon S3
From the Sources tab, click Add Courier. The Add Source page opens.
Find, and then click the icon for Amazon S3. The Add Courier page opens.
Enter the name of the courier. For example: “Amazon S3”.
From the Credential Type drop-down select a credential type.
Choose iam-role-to-role to use cross-account roles (recommended).
Choose iam-credential to use standard IAM credentials.
Tip
Review the linked sections for more information about each option.
Under Amazon S3 Settings, add the name of the Amazon S3 bucket and prefix.
Under Amazon S3 Settings configure the list of files to pull to Amperity. Configure the Entities List for each file to be loaded to Amperity.
Under Amazon S3 Settings set the load operations to a string that is obviously incorrect, such as
df-xxxxxx
. (You may also set the load operation to empty:{}
.)Tip
If you use an obviously incorrect string, the load operation settings will be saved in the courier configuration. After the schema for the feed is defined and the feed is activated, you can edit the courier and replace the feed ID with the correct identifier.
Caution
If load operations are not set to
{}
the validation test for the courier configuration settings will fail.Click Save.
Get sample files¶
Every Amazon S3 file that is pulled to Amperity must be configured as a feed. Before you can configure each feed you need to know the schema of that file. Run the courier without load operations to bring sample files from Amazon S3 to Amperity, and then use each of those files to configure a feed.
To get sample files
From the Sources tab, open the menu for a courier configured for Amazon S3 with empty load operations, and then select Run. The Run Courier dialog box opens.
Select Load data from a specific day, and then select today’s date.
Click Run.
Important
The courier run will fail, but this process will successfully return a list of files from Amazon S3.
These files will be available for selection as an existing source from the Add Feed dialog box.
Wait for the notification for this courier run to return an error similar to:
Error running load-operations task Cannot find required feeds: "df-xxxxxx"
Add feeds¶
A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.
Note
A feed must be added for each file that is pulled from Amazon S3, including all files that contain customer records and interaction records, along with any other files that will be used to support downstream workflows.
To add a feed
From the Sources tab, click Add Feed. This opens the Add Feed dialog box.
Under Data Source, select Create new source, and then enter “Amazon S3”.
Enter the name of the feed in Feed Name. For example: “CustomerRecords”.
Tip
The name of the domain table will be “<data-source-name>:<feed-name>”. For example: “Amazon S3:CustomerRecords”.
Under Sample File, select Select existing file, and then choose from the list of files. For example: “filename_YYYY-MM-DD.csv”.
Tip
The list of files that is available from this drop-down menu is sorted from newest to oldest.
Select Load sample file on feed activation.
Click Continue. This opens the Feed Editor page.
Select the primary key.
Apply semantic tags to customer records and interaction records, as appropriate.
Under Last updated field, specify which field best describes when records in the table were last updated.
Tip
Choose Generate an “updated” field to have Amperity generate this field. This is the recommended option unless there is a field already in the table that reliably provides this data.
For feeds with customer records (PII data), select Make available to Stitch.
Click Activate. Wait for the feed to finish loading data to the domain table, and then review the sample data for that domain table from the Data Explorer.
Add load operations¶
After the feeds are activated and domain tables are available, add the load operations to the courier used for Amazon S3.
Example load operations
Load operations must specify each file that will be pulled to Amperity from Amazon S3.
For example:
{
"CUSTOMER-RECORDS-FEED-ID": [
{
"type": "truncate"
},
{
"type": "load",
"file": "customer-records"
}
],
"TRANSACTION-RECORDS-FEED-ID": [
{
"type": "load",
"file": "transaction-records"
}
]
}
To add load operations
From the Sources tab, open the menu for the courier that was configured for Amazon S3, and then select Edit. The Edit Courier dialog box opens.
Edit the load operations for each of the feeds that were configured for Amazon S3 so they have the correct feed ID.
Click Save.
Run courier manually¶
Run the courier again. This time, because the load operations are present and the feeds are configured, the courier will pull data from Amazon S3.
To run the courier manually
From the Sources tab, open the menu for the courier with updated load operations that is configured for Amazon S3, and then select Run. The Run Courier dialog box opens.
Select the load option, either for a specific time period or all available data. Actual data will be loaded to a domain table because the feed is configured.
Click Run.
This time the notification will return a message similar to:
Completed in 5 minutes 12 seconds
Add to courier group¶
A courier group is a list of one (or more) couriers that are run as a group, either ad hoc or as part of an automated schedule. A courier group can be configured to act as a constraint on downstream workflows.
To add the courier to a courier group
From the Sources tab, click Add Courier Group. This opens the Create Courier Group dialog box.
Enter the name of the courier. For example: “Amazon S3”.
Add a cron string to the Schedule field to define a schedule for the orchestration group.
A schedule defines the frequency at which a courier group runs. All couriers in the same courier group run as a unit and all tasks must complete before a downstream process can be started. The schedule is defined using cron.
Cron syntax specifies the fixed time, date, or interval at which cron will run. Each line represents a job, and is defined like this:
┌───────── minute (0 - 59) │ ┌─────────── hour (0 - 23) │ │ ┌───────────── day of the month (1 - 31) │ │ │ ┌────────────── month (1 - 12) │ │ │ │ ┌─────────────── day of the week (0 - 6) (Sunday to Saturday) │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ * * * * * command to execute
For example,
30 8 * * *
represents “run at 8:30 AM every day” and30 8 * * 0
represents “run at 8:30 AM every Sunday”. Amperity validates your cron syntax and shows you the results. You may also use crontab guru to validate cron syntax.Set Status to Enabled.
Specify a time zone.
A courier group schedule is associated with a time zone. The time zone determines the point at which a courier group’s scheduled start time begins. A time zone should be aligned with the time zone of system from which the data is being pulled.
Note
The time zone that is chosen for an courier group schedule should consider every downstream business processes that requires the data and also the time zone(s) in which the consumers of that data will operate.
Set SLA? to False. (You can change this later after you have verified the end-to-end workflows.)
Add at least one courier to the courier group. Select the name of the courier from the Courier drop-down. Click + Add Courier to add more couriers.
Click Add a courier group constraint, and then select a courier group from the drop-down list.
A wait time is a constraint placed on a courier group that defines an extended time window for data to be made available at the source location.
A courier group typically runs on an automated schedule that expects customer data to be available at the source location within a defined time window. However, in some cases, the customer data may be delayed and isn’t made available within that time window.
For each courier group constraint, apply any offsets.
An offset is a constraint placed on a courier group that defines a range of time that is older than the scheduled time, within which a courier group will accept customer data as valid for the current job. Offset times are in UTC.
A courier group offset is typically set to be 24 hours. For example, it’s possible for customer data to be generated with a correct file name and datestamp appended to it, but for that datestamp to represent the previous day because of the customer’s own workflow. An offset ensures that the data at the source location is recognized by the courier as the correct data source.
Warning
An offset affects couriers in a courier group whether or not they run on a schedule.
Click Save.