Pull from Amazon S3¶
Amazon Simple Storage Service (Amazon S3) stores customer data files of any size in any file formats.
Amperity can pull data from Amazon S3. Amazon S3 is an efficient cloud storage option that supports a wide variety of file types, file formats, and file sizes and is the most frequently used data source across all Amperity tenants.
Common scenarios include:
One-time uploads of files
Regular uploads of files from upstream systems that cannot connect directly to Amperity
Apache Parquet uploads made available as output from an upstream cloud database into an Amazon S3 bucket
Note
The legacy courier workflow is required for certain use cases: XML and files that contain complex/nested JSON, ingest queries, and couriers that support many feeds.
This topic describes the steps that are required to pull files in any supported format to Amperity from Amazon S3:
Get details¶
The Amazon S3 data source requires the following configuration details:
The name of the S3 bucket from which data will be pulled to Amperity. |
|
For cross-account role assumption you will need the value for the Target Role ARN, which enables Amperity to access the customer-managed Amazon S3 bucket. Note The values for the Amperity Role ARN and the External ID fields are provided automatically. Review the following sample policy, and then add a similar policy to the customer-managed Amazon S3 bucket that allows Amperity access to the bucket. Add this policy as a trusted policy to the IAM role that is used to manage access to the customer-managed Amazon S3 bucket. The policy for the customer-managed Amazon S3 bucket is unique, but will be similar to: {
"Statement": [
{
"Sid": "AllowAmperityAccess",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::account:role/resource"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "01234567890123456789"
}
}
}
]
}
The value for the role ARN is similar to: arn:aws:iam::123456789012:role/prod/amperity-plugin
An external ID is an alphanumeric string between 2-1224 characters (without spaces) and may include the following symbols: plus (+), equal (=), comma (,), period (.), at (@), colon (:), forward slash (/), and hyphen (-). |
Filedrop recommendations¶
You may reference the following sections while configuring this data source:
Using credentials that allow Amperity to access, and then read data from this location
Adding an optional RSA key for public key credentials
Ensuring files are provided in a supported file format
Ensuring files are provided with the correct date format
Supporting the desired file compression and/or archive method
Encrypting files before they are added to the location using PGP encryption; an encryption key must be configured so that files can be decrypted by Amperity prior to loading them
Tip
Use SnapPass to securely share your organization’s credentials and encryption keys with your Amperity representative.
Configure cross-account roles¶
Amperity prefers to pull data from and send data to customer-managed cloud storage.
Amperity requires using cross-account role assumption to manage access to Amazon S3 to ensure that customer-managed security policies control access to data.
This approach ensures that customers can:
Directly manage the IAM policies that control access to data
Directly manage the files that are available within the Amazon S3 bucket
Modify access without requiring involvement by Amperity; access may be revoked at any time by either Amazon AWS account, after which data sharing ends immediately
Directly troubleshoot incomplete or missing files
Note
After setting up cross-account role assumption, a list of files (by filename and file type), along with any sample files, must be made available to allow for feed creation. These files may be placed directly into the shared location after cross-account role assumption is configured.
Can I use an Amazon AWS Access Point?
Yes, but with the following limitations:
The direction of access is Amperity access files that are located in a customer-managed Amazon S3 bucket
A credential-free role-to-role access pattern is used
Traffic is not restricted to VPC-only
To configure an S3 bucket for cross-account role assumption
The following steps describe how to configure Amperity to use cross-account role assumption to pull data from (or push data to) a customer-managed Amazon S3 bucket.
Important
These steps require configuration changes to customer-managed Amazon AWS accounts and must be done by users with administrative access.
Open the Sources tab to configure credentials for Amazon S3. Click the Add courier button to open the Add courier dialog box. Do one of the following to select Amazon S3:
|
|
From the Credentials dialog box, enter a name for the credential, select the iam-role-to-role credential type, and then select “Create new credential”. |
|
Next configure the settings that are specific to cross-account role assumption. The values for the Amperity Role ARN and External ID fields – the Amazon Resource Name (ARN) for your Amperity tenant and its external ID – are provided automatically. You must provide the values for the Target Role ARN and S3 Bucket Name fields. Enter the target role ARN for the IAM role that Amperity will use to access the customer-managed Amazon S3 bucket, and then enter the name of the Amazon S3 bucket. |
|
Review the following sample policy, and then add a similar policy to the customer-managed Amazon S3 bucket that allows Amperity access to the bucket. Add this policy as a trusted policy to the IAM role that is used to manage access to the customer-managed Amazon S3 bucket. The policy for the customer-managed Amazon S3 bucket is unique, but will be similar to: {
"Statement": [
{
"Sid": "AllowAmperityAccess",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::account:role/resource"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "01234567890123456789"
}
}
}
]
}
The value for the role ARN is similar to: arn:aws:iam::123456789012:role/prod/amperity-plugin
An external ID is an alphanumeric string between 2-1224 characters (without spaces) and may include the following symbols: plus (+), equal (=), comma (,), period (.), at (@), colon (:), forward slash (/), and hyphen (-). |
|
Click Continue to test the configuration (and validate the connection) to the customer-managed Amazon S3 bucket, after which you will be able to continue the steps for adding a courier. |
Add data source and feed¶
Add a data source that pulls data from an Amazon S3 bucket for each file that you want to pull to Amperity.
Browse the Amazon S3 bucket to select a file, and then review the settings for that file. Define the feed schema, and then activate the feed. Run the courier manually, and then review the data that is added to the domain table that is associated with the feed.
To add a data source for an Amazon S3 bucket
Open the Sources page to configure Amazon S3. Click the Add courier button to open the Add courier dialog box. Select Amazon S3. Do one of the following:
|
|
Credentials allow Amperity to connect to Amazon S3 and must exist before a courier can be configured to pull data from Amazon S3. Select an existing credential from the Credential drop-down, and then click Continue. Tip A courier that has credentials that are configured correctly will show a “Connection successful” status, similar to: |
|
Select the file that will be pulled to Amperity, either directly (by going into the Amazon S3 bucket and selecting it) or by providing a filename pattern. Click Browse to open the File browser. Select the file that will be pulled to Amperity, and then click Accept. Use a filename pattern to define files that will be loaded on a recurring basis, but will have small changes to the filename over time, such as having a datestamp appended to the filename. Note For a new feed, this file is also used as the sample file that is used to define the schema. For an existing feed, this file must match the schema that has already been defined. Use the PGP credential setting to specify the credentials to use for an encrypted file. |
|
Review the file. The contents of the file may be previewed as a table and in a raw format. Switch between these views using the Table and Raw buttons, and then click Refresh to view the file in that format. Note PGP encrypted files can be previewed. Apache Parquet PGP encrypted files must be less than 500 MB to be previewed. Amperity will infer formatting details, and then adds these details to a series of settings located along the left side of the file view. File settings include:
Review the file, and then update these settings, if necessary. Note Amperity supports the following file types: Apache Avro, Apache Parquet, CSV, DSV, JSON, NDJSON, PSV, TSV, and XML. Refer to those reference pages for details about each of the individual file formats. Files that contain nested JSON (or “complex JSON”) or XML may require using the legacy courier configuration. |
|
A feed defines the schema for a file that is loaded to Amperity, after which that data is loaded into a domain table and ready for use with workflows within Amperity. There are two options for feeds: use a new feed or use an existing feed. Use a new feed To use a new feed, choose the Create new feed option, select an existing source from the Source drop-down or type the name of a new data source, and then enter the name of the feed. After you choose a load type and save the courier configuration, you will configure the feed using the data within the sample file. Use an existing feed To use an existing feed, choose the Use existing feed option to use an existing schema. This option requires this file to match all of the feed-specific settings, such as incoming field names, field types, and primary keys. The data within the file may be different. Load types The load type defines how data in the file will be loaded to the associated domain table. Use the Truncate and load option to delete all rows in the associated domain table prior to loading data. Use the Load option to load data from the selected file to the associated domain table. Note When a file is loaded to a domain table using an existing file, the file that is loaded must have the same schema as the existing feed. The data in the file may be new. |
|
Use the feed editor to do all of the following:
When finished, click Activate. |
|
Find the courier related to the feed that was just activated, and then run it manually. On the Sources page, under Couriers, find the courier you want to run and then select Run from the actions menu. Select a date from the calendar picker that is before today, but after the date on which the file was added to the Amazon S3 bucket. Leave the load options in the Run courier dialog box unselected, and then click Run. After the courier has run successfully, inspect the domain table that contains the data that was loaded to Amperity. After you have verified that the data is correct, you may do any of the following:
|
Workflow actions¶
A workflow will occasionally show an error that describes what prevented a workflow from completing successfully. These first appear as alerts in the notifications pane. The alert describes the error, and then links to the Workflows tab.
Open the Workflows page to review a list of workflow actions, choose an action to resolve the workflow error, and then follow the steps that are shown.
You may receive a notifications error for a configured Amazon S3 data source. This appears as an alert in the notifications pane on the Destinations tab. If you receive a notification error, review the details, and then click the View Workflow link to open this notification error in the Workflows page. |
|
On the Workflows page, review the individual steps to determine which step(s) have errors that require your attention, and then click Show Resolutions to review the list of workflow actions that were generated for this error. |
|
A list of individual workflow actions are shown. Review the list to identify which action you should take. Some workflow actions are common across workflows and will often be available, such as retrying a specific task within a workflow or restarting a workflow. These types of actions can often resolve an error. In certain cases, actions are specific and are shown when certain conditions exist in your tenant. These types of actions typically must be resolved and may require steps that must be done upstream or downstream from your Amperity workflow. Amperity provides a series of workflow actions that can help resolve specific issues that may arise with Amazon S3, including: |
|
Select a workflow action from the list of actions, and then review the steps for resolving that error. After you have completed the steps in the workflow action, click Continue to rerun the workflow. |
Bad archive¶
Sometimes the contents of an archive are corrupted and cannot be loaded to Amperity.
To resolve this error, do the following.
Upload a new file to Amperity.
After the file to the workflow action, and then click Resolve to retry this workflow.
Invalid bucket name¶
The name of the Amazon S3 bucket from which Amperity pulls data must be correctly specified in the configuration for the courier in the Sources page.
To resolve this error, do the following.
To resolve this error, verify name of the Amazon S3 bucket, and then update the configuration in Amperity to match.
Open the AWS management console and verify the name of the Amazon S3 bucket.
Open the Sources page in Amperity, and then open the courier that is associated with this workflow.
Update the courier configuration for the correct Amazon S3 bucket name.
Return to the workflow action, and then click Resolve to retry.
Invalid credentials¶
The credentials that are defined in Amperity are invalid.
To resolve this error, verify that the credentials required by this workflow are valid.
Open the Credentials page.
Review the details for the credentials used with this workflow. Update the credentials for Amazon S3 if required.
Return to the workflow action, and then click Resolve to retry this workflow.
Missing file¶
An archive that does not contain a file that is expected to be within an archive will return a workflow error; Amperity will be unable to complete the workflow until the issue is resolved.
To resolve this error, do the following.
Add the required file to the archive.
or
Update the configuration for the courier that is attempting to load the missing file to not require that file.
After the file is added to the archive or removed from the courier configuration, click Resolve to retry this workflow.
PGP error¶
A workflow action is created when a file cannot be decrypted using the provided PGP key.
To resolve this error, verify the PGP key.
Open the Sources page.
Review the details for the PGP key.
If the PGP key is correct, verify that the file that is associated with this workflow error was encrypted using the correct PGP key. If necessary, upload a new file.
Return to the workflow action, and then click Resolve to retry this workflow.
Unable to decompress archive¶
An archive that cannot be decompressed will return a workflow error; Amperity will be unable to complete the workflow until the issue is resolved.
This issue may be shown when the name of the archive doesn’t match the name of the configured archive or when Amperity is attempting to decompress a file (and not an archive). In some cases, the contents of the archive file may be the reason why Amperity is unable to decompress the archive.
To resolve this error, do the following.
Verify the configuration for the archive, and then verify the contents of the archive.
Update the configuration, if neccessary. For example, when Amperity is attempting to decompress a file, update the configuration to specify a file and not an archive.
In some cases, re-loading the archive to the location from which Amperity is attempting to pull the archive is necessary.
Return to the workflow action, and then click Resolve to retry this workflow.