Pull from Azure Blob Storage

Azure Blob Storage is an object storage solution for the cloud that is optimized for storing massive amounts of unstructured data.

Important

Use this data source to pull data to Amperity from Azure Data Lake Storage Gen1 or Azure Data Lake Storage Gen2.

This topic describes the steps that are required to pull files in any supported format to Amperity from Azure Blob Storage:

  1. Get details

  2. Configure credentials

  3. Add data source

Get details

The Azure Blob Storage data source requires the following configuration details:

Detail one.

The name of the container in Azure Blob Storage from which Amperity will pull data.

Detail two.

Credentials that allow Amperity to access the container. This may be done using credentials (a connection string, a shared access signature, or a storage URI) or using Azure Data Share.

Filedrop recommendations

You may reference the following sections while configuring this data source:

Configure credentials

Use one of the following options to configure Amperity to pull data from Azure Blob Storage:

Azure credentials

A source that uses Azure Blob Storage credentials may use a connection string, a shared access signature, or a storage URI.

To configure Amperity to use Microsoft Azure credentials

Step 1.

The URL for the Azure authentication endpoint.

Step 2.

The information needed depends on the selected credential type.

Connection string

This method uses a connection string and the name of the container from which Amperity will pull data.

Shared access signature

This method uses the Microsoft Azure account name, a shared access signature, and the name of the container from which Amperity will pull data.

Tip

When using a shared access signature, create a policy in Microsoft Azure, and then generate a token for the shared access signature against that policy. This allows the policy to expire instead of the token. Manage the expiration of the token by managing the expiration of the policy.

Storage URI

This method uses the storage URI and the name of the container from which Amperity will pull data.

Step 3.

Configure a data source and feed to pull data from configured location, after which the list of files (by filename and file type) will be visible from Amperity.

Azure Data Share

Azure Data Share is a simple and safe service for sharing data in any format and any size with Amperity. Azure Data Share requires no infrastructure setup or management and uses underlying Azure security measures as they are applied to both Azure accounts. Snapshot-based sharing of data can be automated and does not require a special access key.

Amperity prefers to pull data from customer-managed cloud storage. This approach ensures that customers can:

  • Use security policies managed in Azure Data Share to manage access to data

  • Directly manage the files that are made available

  • Modify access without requiring involvement by Amperity; access may be revoked at any time by either Azure account, after which data sharing ends immediately

  • Directly troubleshoot incomplete or missing files

Amperity recommends to use Azure Data Share to manage access to customer-managed cloud storage in Azure. This allows managed security policies to control access to data.

After setting up Azure Data Share, a list of files (by filename and file type), along with any sample files, must be made available to allow for feed creation. These files may be placed directly into the shared location after Azure Data Share is configured.

Note

Data that is shared with Amperity via Azure Data Share can be removed from Amperity by submitting a request for data removal to Amperity Customer Support.

To send data to Amperity using Azure Data Share

The following steps describe how to configure Amperity to use Azure Data Share to pull data from customer-managed Azure Blob Storage.

Important

These steps require configuration changes to both customer- and Amperity-managed Azure accounts and must be done by users with administrative access.

Step 1.

The customer sends Amperity an invitation to set up data sharing.

Amperity accepts the invitation to set up data sharing.

Step 2.

The customer determines the location from which data will be shared with Amperity, and then configures the schedule for how frequently snapshots of the data will be shared.

Amperity places shared data into a customer-dedicated Azure Blob Storage instance.

Step 3.

Configure a data source and feed to pull data from the customer-dedicated Azure Blob Storage instance.

Add data source and feed

Add a data source that pulls data from an Azure Blob Storage bucket for each file that you want to pull to Amperity.

Browse the Azure Blob Storage bucket to select a file, and then review the settings for that file. Define the feed schema, and then activate the feed. Run the courier manually, and then review the data that is added to the domain table that is associated with the feed.

To add a data source for an Amazon S3 bucket

Step 1.

Open the Sources page to configure Azure Blob Storage.

Click the Add courier button to open the Add courier dialog box.

Add

Select Azure Blob Storage. Do one of the following:

  1. Click the row in which Azure Blob Storage is located. Sources are listed alphabetically.

  2. Search for Azure Blob Storage. Start typing “azu”. The list will filter to show only matching sources.

Step 2.

Credentials allow Amperity to connect to Azure Blob Storage and must exist before a courier can be configured to pull data from Azure Blob Storage. Select an existing credential from the Credential drop-down, and then click Continue.

Tip

A courier that has credentials that are configured correctly will show a “Connection successful” status, similar to:

Add
Step 3.

Select the file that will be pulled to Amperity, either directly (by going into the container bucket and selecting it) or by providing a filename pattern.

Add

Click Browse to open the File browser. Select the file that will be pulled to Amperity, and then click Accept.

Use a filename pattern to define files that will be loaded on a recurring basis, but will have small changes to the filename over time, such as having a datestamp appended to the filename.

Note

For a new feed, this file is also used as the sample file that is used to define the schema. For an existing feed, this file must match the schema that has already been defined.

Add

Use the PGP credential setting to specify the credentials to use for an encrypted file.

Add
Step 4.

Review the file.

Add

The contents of the file may be previewed as a table and in a raw format. Switch between these views using the Table and Raw buttons, and then click Refresh to view the file in that format.

Note

PGP encrypted files can be previewed. Apache Parquet PGP encrypted files must be less than 500 MB to be previewed.

Amperity will infer formatting details, and then adds these details to a series of settings located along the left side of the file view. File settings include:

  • Delimiter

  • Compression

  • Escape character

  • Quote character

  • Header row

Review the file, and then update these settings, if necessary.

Note

Amperity supports the following file types: Apache Avro, Apache Parquet, CSV, DSV, JSON, NDJSON, PSV, TSV, and XML.

Refer to those reference pages for details about each of the individual file formats.

Files that contain nested JSON (or “complex JSON”) or XML may require using the legacy courier configuration.

Step 5.

A feed defines the schema for a file that is loaded to Amperity, after which that data is loaded into a domain table and ready for use with workflows within Amperity.

There are two options for feeds: use a new feed or use an existing feed.

Use a new feed

To use a new feed, choose the Create new feed option, select an existing source from the Source drop-down or type the name of a new data source, and then enter the name of the feed.

Add

After you choose a load type and save the courier configuration, you will configure the feed using the data within the sample file.

Use an existing feed

To use an existing feed, choose the Use existing feed option to use an existing schema.

Add

This option requires this file to match all of the feed-specific settings, such as incoming field names, field types, and primary keys. The data within the file may be different.

Load types

The load type defines how data in the file will be loaded to the associated domain table.

Add

Use the Truncate and load option to delete all rows in the associated domain table prior to loading data.

Use the Load option to load data from the selected file to the associated domain table.

Note

When a file is loaded to a domain table using an existing file, the file that is loaded must have the same schema as the existing feed. The data in the file may be new.

Step 6.

Use the feed editor to do all of the following:

  • Set the primary key

  • Choose the field that best presents when the data in the table was last updated; if there is not an obvious choice, use the “Generate an updated field” option.

  • For each field in the incoming data, validate the field name and semantic tag columns in the feed. Make any necessary adjustments.

  • For tables that contain customer records, enable the “Make available to Stitch” to ensure the values in this data source are used for identity resolution.

When finished, click Activate.

Step 7.

Find the courier related to the feed that was just activated, and then run it manually.

On the Sources page, under Couriers, find the courier you want to run and then select Run from the actions menu.

Add

Select a date from the calendar picker that is before today, but after the date on which the file was added to the Azure Blob Storage bucket.

Add

Leave the load options in the Run courier dialog box unselected, and then click Run.

After the courier has run successfully, inspect the domain table that contains the data that was loaded to Amperity. After you have verified that the data is correct, you may do any of the following:

  • If the data contains customer records, edit the feed and make that data available to Stitch.

  • If the data should be loaded to Amperity on a regular basis, add the courier to a courier group that runs on the desired schedule.

  • If the data will be a foundation for custom domain tables, use Spark SQL to build out that customization.

Workflow actions

A workflow will occasionally show an error that describes what prevented a workflow from completing successfully. These first appear as alerts in the notifications pane. The alert describes the error, and then links to the Workflows tab.

Open the Workflows page to review a list of workflow actions, choose an action to resolve the workflow error, and then follow the steps that are shown.

Step one.

You may receive a notifications error for a configured Azure Blob Storage data source. This appears as an alert in the notifications pane on the Destinations tab.

Review a notifications error.

If you receive a notification error, review the details, and then click the View Workflow link to open this notification error in the Workflows page.

Step two.

On the Workflows page, review the individual steps to determine which step(s) have errors that require your attention, and then click Show Resolutions to review the list of workflow actions that were generated for this error.

The workflow tab, showing a workflow with errors.
Step three.

A list of individual workflow actions are shown. Review the list to identify which action you should take.

Choose a workflow action from the list of actions.

Some workflow actions are common across workflows and will often be available, such as retrying a specific task within a workflow or restarting a workflow. These types of actions can often resolve an error.

In certain cases, actions are specific and are shown when certain conditions exist in your tenant. These types of actions typically must be resolved and may require steps that must be done upstream or downstream from your Amperity workflow.

Amperity provides a series of workflow actions that can help resolve specific issues that may arise with Azure Blob Storage, including:

Step four.

Select a workflow action from the list of actions, and then review the steps for resolving that error.

Choose a workflow action from the list of actions.

After you have completed the steps in the workflow action, click Continue to rerun the workflow.

Bad archive

Sometimes the contents of an archive are corrupted and cannot be loaded to Amperity.

To resolve this error, do the following.

  1. Upload a new file to Amperity.

  2. After the file to the workflow action, and then click Resolve to retry this workflow.

Invalid credentials

The credentials that are defined in Amperity are invalid.

To resolve this error, verify that the credentials required by this workflow are valid.

  1. Open the Credentials page.

  2. Review the details for the credentials used with this workflow. Update the credentials for Azure Blob Storage if required.

  3. Return to the workflow action, and then click Resolve to retry this workflow.

Invalid permissions

Microsoft Azure may be configured to use a shared access signature (SAS) to grant restricted access rights to Microsoft Azure storage resources.

What is a shared access signature (SAS)?

A shared access signature (SAS) grants limited access to storage resources in Microsoft Azure. A SAS may be constrained to access only specific storage resources, have specific permissions to those resources, and be configured to expire after a set amount of time. Every SAS is signed with a key.

The SAS is appended to the URI for a storage resource. The combined URI and SAS become a token that contains a set of query parameters that indiciate how a storage resource may be accessed. Use the SAS token to configure Amperity credentials to storage resources in Microsoft Azure.

An SAS token may have invalid permissions for any of the following situations:

  1. The SAS token may be configured incorrectly within Amperity. For example: an extra character within or at at the end of the SAS token. Verify the string, and then make any updates that are required for the credentials within Amperity.

  2. The permissions for the SAS token were configured incorrectly. Amperity requires an SAS token to be assigned the following permissions: READ, ADD, CREATE, WRITE, DELETE, and LIST.

  3. The SAS token may have expired or the signing key associated with the SAS token may have been rotated.

    These situations will require generating a new SAS token, and then updating the credentials in Amperity.

Note

If the shared access signature was provisioned by Amperity, please use the “Report a problem” feature in Amperity to contact your Amperity Support team and ask for help resolving this workflow issue.

The “Report a problem” option is available from the    menu in the top navigation.

To resolve this error, determine the cause for the invalid permissions error.

  1. Do one (or more) of the following:

    Verify that the SAS token was configured correctly within Amperity.

    Verify the permissions that have been assigned to the SAS token. This can be done from the Microsoft Azure Portal or by using Azure Storage Explorer . The policy for the SAS token must be assigned the following permissions: READ, ADD, CREATE, WRITE, DELETE, and LIST.

    Verify that the SAS token and/or the signing key associated with the SAS token is valid (and has not expired). If either have expired, generate a new SAS token (using a new signing key, if necessary).

  2. After you have determined the cause of the invalid permissions error, make the appropriate updates within Microsoft Azure and/or the credentials for this destination within Amperity.

  3. Return to the workflow action, and then click Resolve to retry this workflow.

Missing file

An archive that does not contain a file that is expected to be within an archive will return a workflow error; Amperity will be unable to complete the workflow until the issue is resolved.

To resolve this error, do the following.

  1. Add the required file to the archive.

    or

    Update the configuration for the courier that is attempting to load the missing file to not require that file.

  2. After the file is added to the archive or removed from the courier configuration, click Resolve to retry this workflow.

PGP error

A workflow action is created when a file cannot be decrypted using the provided PGP key.

To resolve this error, verify the PGP key.

  1. Open the Sources page.

  2. Review the details for the PGP key.

    If the PGP key is correct, verify that the file that is associated with this workflow error was encrypted using the correct PGP key. If necessary, upload a new file.

  3. Return to the workflow action, and then click Resolve to retry this workflow.

Unable to decompress archive

An archive that cannot be decompressed will return a workflow error; Amperity will be unable to complete the workflow until the issue is resolved.

This issue may be shown when the name of the archive doesn’t match the name of the configured archive or when Amperity is attempting to decompress a file (and not an archive). In some cases, the contents of the archive file may be the reason why Amperity is unable to decompress the archive.

To resolve this error, do the following.

  1. Verify the configuration for the archive, and then verify the contents of the archive.

    Update the configuration, if neccessary. For example, when Amperity is attempting to decompress a file, update the configuration to specify a file and not an archive.

    In some cases, re-loading the archive to the location from which Amperity is attempting to pull the archive is necessary.

  2. Return to the workflow action, and then click Resolve to retry this workflow.