Pull from Azure Blob Storage (Legacy)

Azure Blob Storage is an object storage solution for the cloud that is optimized for storing massive amounts of unstructured data.

Important

Use this data source to pull data to Amperity from Azure Data Lake Storage Gen1 or Azure Data Lake Storage Gen2.

This topic describes the steps that are required to pull files in any supported format to Amperity from Azure Blob Storage:

  1. Get details

  2. Add courier

  3. Get sample files

  4. Add feeds

  5. Add load operations

  6. Run courier

  7. Add to courier group

Get details

Amperity can be configured to pull data from Azure Blob Storage. This may be done using Azure Data Share or by using Azure Blob Storage credentials.

Filedrop recommendations

You may reference the following sections while configuring this data source:

Use Azure Data Share

Azure Data Share is a simple and safe service for sharing data in any format and any size with Amperity. Azure Data Share requires no infrastructure setup or management and uses underlying Azure security measures as they are applied to both Azure accounts. Snapshot-based sharing of data can be automated and does not require a special access key.

Amperity prefers to pull data from customer-managed cloud storage. This approach ensures that customers can:

  • Use security policies managed in Azure Data Share to manage access to data

  • Directly manage the files that are made available

  • Modify access without requiring involvement by Amperity; access may be revoked at any time by either Azure account, after which data sharing ends immediately

  • Directly troubleshoot incomplete or missing files

Amperity recommends to use Azure Data Share to manage access to customer-managed cloud storage in Azure. This allows managed security policies to control access to data.

After setting up Azure Data Share, a list of files (by filename and file type), along with any sample files, must be made available to allow for feed creation. These files may be placed directly into the shared location after Azure Data Share is configured.

To send data to Amperity using Azure Data Share

The following steps describe how to configure Amperity to use Azure Data Share to pull data from customer-managed Azure Blob Storage.

Important

These steps require configuration changes to both customer- and Amperity-managed Azure accounts and must be done by users with administrative access.

  1. The customer sends Amperity an invitation to set up data sharing.

  2. Amperity accepts the invitation to set up data sharing.

  3. The customer determines the location from which data will be shared with Amperity.

  4. The customer schedules snapshots of the data to be shared, including the frequency of data sharing.

  5. Amperity places shared data into customer-dedicated Azure Blob Storage instance.

  6. Amperity configures a courier to pull data from that location into Amperity.

Note

Data that is shared with Amperity via Azure Data Share can be removed from Amperity by submitting a request for data removal to Amperity Customer Support.

Use credentials

A source that uses credentials to send data to Amperity from Azure Blob Storage requires the following information be sent to Amperity via SnapPass:

  1. The URL for the Azure authentication endpoint. This is typically in the format of https://login.microsoftonline.com/<directory_id>/oauth2/token.

  2. The information needed for the selected credentials method: shared access credentials, a storage URI, or a connection string.

  3. The name of the container, the blob prefix, and credential details. (These vary depending on the chosen credential method.)

  4. A list of files (by filename and file type) in Azure Blob Storage to be sent to Amperity.

  5. A sample for each file to simplify feed creation.

Add courier

A courier brings data from an external system to Amperity.

Tip

You can run a courier with an empty load operation using {} as the value for the load operation. Use this approach to get files to upload during feed creation, as a feed requires knowing the schema of a file before you can apply semantic tagging and other feed configuration settings.

for Azure Data Share

Azure Data Share is a simple and safe service for sharing data in any format and any size with Amperity. Azure Data Share requires no infrastructure setup or management and uses underlying Azure security measures as they are applied to both Azure accounts. Snapshot-based sharing of data can be automated and does not require a special access key.

Review Use Azure Data Share for more information about using Azure Data Share.

To add a courier

  1. From the Sources page, click Add Courier. The Add Source page opens.

  2. Find, and then click the icon for Azure Blob Storage. The Add Courier page opens.

    This automatically selects azure-blob-connection-string as the Credential Type. You may switch to using azure-blob-storage-uri or azure-blob-shared-access-signature.

  3. Under Azure Blob Storage Settings, add the name of the container and the blob prefix.

  4. Under Azure Blob Storage Settings configure the list of files to pull to Amperity. Configure the Entities List for each file to be loaded to Amperity. For example, two files: “CustomerRecords.csv” and “TransactionRecords.csv”.

    [
      {
        "object/type": "file",
        "object/file-pattern": "'/path/to/CustomerRecords.csv'",
        "object/land-as": {
          "file/header-rows": 1,
          "file/tag": "customer-records-2019",
          "file/content-type": "text/csv"
        }
      },
      {
        "object/type": "file",
        "object/file-pattern": "'/path/to/TransactionRecords.csv'",
        "object/land-as": {
          "file/header-rows": 1,
          "file/tag": "transaction-records-2019",
          "file/content-type": "text/csv"
        }
      }
    ]
    
  5. Under Azure Blob Storage Settings set the load operations to a string that is obviously incorrect, such as df-xxxxxx. (You may also set the load operation to empty: “{}”.)

    Tip

    If you use an obviously incorrect string, the load operation settings will be saved in the courier configuration. After the schema for the feed is defined and the feed is activated, you can edit the courier and replace the feed ID with the correct identifier.

    Caution

    If load operations are not set to “{}” or are not set to an obviously incorrect string the validation test for the courier configuration settings will fail.

  6. Click Save.

for Azure credentials

Credentials options to access Azure Blob Storage include shared access signatures, connection strings, and storage URLs.

To add a courier

  1. From the Sources page, click Add Courier. The Add Source page opens.

  2. Find, and then click the icon for Azure Blob Storage. The Add Courier page opens.

    This automatically selects azure-blob-connection-string as the Credential Type. You may switch to using azure-blob-storage-uri or azure-blob-shared-access-signature.

  3. From the Credential drop-down, select Create a new credential. This opens the Create New Credential dialog box.

  4. Enter a name for the credential, any credential type-specific settings, and then click Save.

    For the “azure-blob-connection-string” credential type, enter the name of the credential and the connection string.

    For the “azure-blob-storage-uri” credential type, enter the name of the credential and the URI for the Azure Data Lake Storage instance.

    For the “azure-blob-shared-access-signature” credential type, enter the name of the credential, the account name, and the shared access signature.

  5. Under Azure Blob Storage Settings, add the name of the container and the blob prefix.

  6. Under Azure Blob Storage Settings configure the list of files to pull to Amperity. Configure the Entities List for each file to be loaded to Amperity. For example, two files: “CustomerRecords.csv” and “TransactionRecords.csv”.

    [
      {
        "object/type": "file",
        "object/file-pattern": "'/path/to/CustomerRecords.csv'",
        "object/land-as": {
          "file/header-rows": 1,
          "file/tag": "customer-records-2019",
          "file/content-type": "text/csv"
        }
      },
      {
        "object/type": "file",
        "object/file-pattern": "'/path/to/TransactionRecords.csv'",
        "object/land-as": {
          "file/header-rows": 1,
          "file/tag": "transaction-records-2019",
          "file/content-type": "text/csv"
        }
      }
    ]
    
  7. Under Azure Blob Storage Settings set the load operations to a string that is obviously incorrect, such as df-xxxxxx. (You may also set the load operation to empty: “{}”.)

    Tip

    If you use an obviously incorrect string, the load operation settings will be saved in the courier configuration. After the schema for the feed is defined and the feed is activated, you can edit the courier and replace the feed ID with the correct identifier.

    Caution

    If load operations are not set to “{}” the validation test for the courier configuration settings will fail.

  8. Click Save.

Get sample files

Every Azure Blob Storage file that is pulled to Amperity must be configured as a feed. Before you can configure each feed you need to know the schema of that file. Run the courier without load operations to bring sample files from Azure Blob Storage to Amperity, and then use each of those files to configure a feed.

To get sample files

  1. From the Sources tab, open the menu for a courier configured for Azure Blob Storage with empty load operations, and then select Run. The Run Courier dialog box opens.

  2. Select Load data from a specific day, and then select today’s date.

  3. Click Run.

    Important

    The courier run will fail, but this process will successfully return a list of files from Azure Blob Storage.

    These files will be available for selection as an existing source from the Add Feed dialog box.

  4. Wait for the notification for this courier run to return an error similar to:

    Error running load-operations task
    Cannot find required feeds: "df-xxxxxx"
    

Add feeds

A feed defines how data should be loaded into a domain table, including specifying which columns are required and which columns should be associated with a semantic tag that indicates that column contains customer profile (PII) and transactions data.

Note

A feed must be added for each file that is pulled from Azure Blob Storage, including all files that contain customer records and interaction records, along with any other files that will be used to support downstream workflows.

To add a feed

  1. From the Sources tab, click Add Feed. This opens the Add Feed dialog box.

  2. Under Data Source, select Create new source, and then enter “Azure Blob Storage”.

  3. Enter the name of the feed in Feed Name. For example: “CustomerRecords”.

    Tip

    The name of the domain table will be “<data-source-name>:<feed-name>”. For example: “Azure Blob Storage:CustomerRecords”.

  4. Under Sample File, select Select existing file, and then choose from the list of files. For example: “filename_YYYY-MM-DD.csv”.

    Tip

    The list of files that is available from this drop-down menu is sorted from newest to oldest.

  5. Select Load sample file on feed activation.

  6. Click Continue. This opens the Feed Editor page.

  7. Select the primary key.

  8. Apply semantic tags to customer records and interaction records, as appropriate.

  9. Under Last updated field, specify which field best describes when records in the table were last updated.

    Tip

    Choose Generate an “updated” field to have Amperity generate this field. This is the recommended option unless there is a field already in the table that reliably provides this data.

  10. For feeds with customer records (PII data), select Make available to Stitch.

  11. Click Activate. Wait for the feed to finish loading data to the domain table, and then review the sample data for that domain table from the Data Explorer.

Add load operations

After the feeds are activated and domain tables are available, add the load operations to the courier used for Azure Blob Storage.

Example load operations

Load operations must specify each file that will be pulled to Amperity from Azure Blob Storage.

For example:

{
  "CUSTOMER-RECORDS-FEED-ID": [
    {
      "type": "truncate"
    },
    {
      "type": "load",
      "file": "customer-records"
    }
  ],
  "TRANSACTION-RECORDS-FEED-ID": [
    {
      "type": "load",
      "file": "transaction-records"
    }
  ]
}

To add load operations

  1. From the Sources tab, open the menu for the courier that was configured for Azure Blob Storage, and then select Edit. The Edit Courier dialog box opens.

  2. Edit the load operations for each of the feeds that were configured for Azure Blob Storage so they have the correct feed ID.

  3. Click Save.

Run courier manually

Run the courier again. This time, because the load operations are present and the feeds are configured, the courier will pull data from Azure Blob Storage.

To run the courier manually

  1. From the Sources tab, open the    menu for the courier with updated load operations that is configured for Azure Blob Storage, and then select Run. The Run Courier dialog box opens.

  2. Select the load option, either for a specific time period or all available data. Actual data will be loaded to a domain table because the feed is configured.

  3. Click Run.

    This time the notification will return a message similar to:

    Completed in 5 minutes 12 seconds
    

Add to courier group

A courier group is a list of one (or more) couriers that are run as a group, either ad hoc or as part of an automated schedule. A courier group can be configured to act as a constraint on downstream workflows.

To add the courier to a courier group

  1. From the Sources tab, click Add Courier Group. This opens the Create Courier Group dialog box.

  2. Enter the name of the courier. For example: “Azure Blob Storage”.

  3. Add a cron string to the Schedule field to define a schedule for the orchestration group.

    A schedule defines the frequency at which a courier group runs. All couriers in the same courier group run as a unit and all tasks must complete before a downstream process can be started. The schedule is defined using cron.

    Cron syntax specifies the fixed time, date, or interval at which cron will run. Each line represents a job, and is defined like this:

    ┌───────── minute (0 - 59)
    │ ┌─────────── hour (0 - 23)
    │ │ ┌───────────── day of the month (1 - 31)
    │ │ │ ┌────────────── month (1 - 12)
    │ │ │ │ ┌─────────────── day of the week (0 - 6) (Sunday to Saturday)
    │ │ │ │ │
    │ │ │ │ │
    │ │ │ │ │
    * * * * * command to execute
    

    For example, 30 8 * * * represents “run at 8:30 AM every day” and 30 8 * * 0 represents “run at 8:30 AM every Sunday”. Amperity validates your cron syntax and shows you the results. You may also use crontab guru to validate cron syntax.

  4. Set Status to Enabled.

  5. Specify a time zone.

    A courier group schedule is associated with a time zone. The time zone determines the point at which a courier group’s scheduled start time begins. A time zone should be aligned with the time zone of system from which the data is being pulled.

    Note

    The time zone that is chosen for an courier group schedule should consider every downstream business processes that requires the data and also the time zone(s) in which the consumers of that data will operate.

  6. Add at least one courier to the courier group. Select the name of the courier from the Courier drop-down. Click + Add Courier to add more couriers.

  7. Click Add a courier group constraint, and then select a courier group from the drop-down list.

    A wait time is a constraint placed on a courier group that defines an extended time window for data to be made available at the source location.

    A courier group typically runs on an automated schedule that expects customer data to be available at the source location within a defined time window. However, in some cases, the customer data may be delayed and isn’t made available within that time window.

  8. For each courier group constraint, apply any offsets.

    An offset is a constraint placed on a courier group that defines a range of time that is older than the scheduled time, within which a courier group will accept customer data as valid for the current job. Offset times are in UTC.

    A courier group offset is typically set to be 24 hours. For example, it’s possible for customer data to be generated with a correct file name and datestamp appended to it, but for that datestamp to represent the previous day because of the customer’s own workflow. An offset ensures that the data at the source location is recognized by the courier as the correct data source.

    Warning

    An offset affects couriers in a courier group whether or not they run on a schedule. Manually run courier groups will not take their schedule into consideration when determining the date range; only the provided input day(s) to load data from are used as inputs.

  9. Click Save.

Workflow actions

A workflow will occasionally show an error that describes what prevented a workflow from completing successfully. These first appear as alerts in the notifications pane. The alert describes the error, and then links to the Workflows tab.

Open the Workflows page to review a list of workflow actions, choose an action to resolve the workflow error, and then follow the steps that are shown.

Step one.

You may receive a notifications error for a configured Azure Blob Storage data source. This appears as an alert in the notifications pane on the Destinations tab.

Review a notifications error.

If you receive a notification error, review the details, and then click the View Workflow link to open this notification error in the Workflows page.

Step two.

On the Workflows page, review the individual steps to determine which step(s) have errors that require your attention, and then click Show Resolutions to review the list of workflow actions that were generated for this error.

The workflow tab, showing a workflow with errors.
Step three.

A list of individual workflow actions are shown. Review the list to identify which action you should take.

Choose a workflow action from the list of actions.

Some workflow actions are common across workflows and will often be available, such as retrying a specific task within a workflow or restarting a workflow. These types of actions can often resolve an error.

In certain cases, actions are specific and are shown when certain conditions exist in your tenant. These types of actions typically must be resolved and may require steps that must be done upstream or downstream from your Amperity workflow.

Amperity provides a series of workflow actions that can help resolve specific issues that may arise with Azure Blob Storage, including:

Step four.

Select a workflow action from the list of actions, and then review the steps for resolving that error.

Choose a workflow action from the list of actions.

After you have completed the steps in the workflow action, click Continue to rerun the workflow.

Bad archive

Sometimes the contents of an archive are corrupted and cannot be loaded to Amperity.

To resolve this error, do the following.

  1. Upload a new file to Amperity.

  2. After the file to the workflow action, and then click Resolve to retry this workflow.

Invalid credentials

The credentials that are defined in Amperity are invalid.

To resolve this error, verify that the credentials required by this workflow are valid.

  1. Open the Credentials page.

  2. Review the details for the credentials used with this workflow. Update the credentials for Azure Blob Storage if required.

  3. Return to the workflow action, and then click Resolve to retry this workflow.

Invalid permissions

Microsoft Azure may be configured to use a shared access signature (SAS) to grant restricted access rights to Microsoft Azure storage resources.

What is a shared access signature (SAS)?

A shared access signature (SAS) grants limited access to storage resources in Microsoft Azure. A SAS may be constrained to access only specific storage resources, have specific permissions to those resources, and be configured to expire after a set amount of time. Every SAS is signed with a key.

The SAS is appended to the URI for a storage resource. The combined URI and SAS become a token that contains a set of query parameters that indiciate how a storage resource may be accessed. Use the SAS token to configure Amperity credentials to storage resources in Microsoft Azure.

An SAS token may have invalid permissions for any of the following situations:

  1. The SAS token may be configured incorrectly within Amperity. For example: an extra character within or at at the end of the SAS token. Verify the string, and then make any updates that are required for the credentials within Amperity.

  2. The permissions for the SAS token were configured incorrectly. Amperity requires an SAS token to be assigned the following permissions: READ, ADD, CREATE, WRITE, DELETE, and LIST.

  3. The SAS token may have expired or the signing key associated with the SAS token may have been rotated.

    These situations will require generating a new SAS token, and then updating the credentials in Amperity.

Note

If the shared access signature was provisioned by Amperity, please use the “Report a problem” feature in Amperity to contact your Amperity Support team and ask for help resolving this workflow issue.

The “Report a problem” option is available from the    menu in the top navigation.

To resolve this error, determine the cause for the invalid permissions error.

  1. Do one (or more) of the following:

    Verify that the SAS token was configured correctly within Amperity.

    Verify the permissions that have been assigned to the SAS token. This can be done from the Microsoft Azure Portal or by using Azure Storage Explorer . The policy for the SAS token must be assigned the following permissions: READ, ADD, CREATE, WRITE, DELETE, and LIST.

    Verify that the SAS token and/or the signing key associated with the SAS token is valid (and has not expired). If either have expired, generate a new SAS token (using a new signing key, if necessary).

  2. After you have determined the cause of the invalid permissions error, make the appropriate updates within Microsoft Azure and/or the credentials for this destination within Amperity.

  3. Return to the workflow action, and then click Resolve to retry this workflow.

Missing file

An archive that does not contain a file that is expected to be within an archive will return a workflow error; Amperity will be unable to complete the workflow until the issue is resolved.

To resolve this error, do the following.

  1. Add the required file to the archive.

    or

    Update the configuration for the courier that is attempting to load the missing file to not require that file.

  2. After the file is added to the archive or removed from the courier configuration, click Resolve to retry this workflow.

PGP error

A workflow action is created when a file cannot be decrypted using the provided PGP key.

To resolve this error, verify the PGP key.

  1. Open the Sources page.

  2. Review the details for the PGP key.

    If the PGP key is correct, verify that the file that is associated with this workflow error was encrypted using the correct PGP key. If necessary, upload a new file.

  3. Return to the workflow action, and then click Resolve to retry this workflow.

Unable to decompress archive

An archive that cannot be decompressed will return a workflow error; Amperity will be unable to complete the workflow until the issue is resolved.

This issue may be shown when the name of the archive doesn’t match the name of the configured archive or when Amperity is attempting to decompress a file (and not an archive). In some cases, the contents of the archive file may be the reason why Amperity is unable to decompress the archive.

To resolve this error, do the following.

  1. Verify the configuration for the archive, and then verify the contents of the archive.

    Update the configuration, if neccessary. For example, when Amperity is attempting to decompress a file, update the configuration to specify a file and not an archive.

    In some cases, re-loading the archive to the location from which Amperity is attempting to pull the archive is necessary.

  2. Return to the workflow action, and then click Resolve to retry this workflow.