Send data to Databricks

Databricks provides a unified platform for data and AI that supports large-scale processing for batch and streaming workloads, standardized machine learning lifecycles, and accelerated data science workflows for large datasets.

Important

Data is not sent from Amperity directly to Databricks. Databricks must connect to a location that contains the files against which Databricks will operate. Amperity must send data to Databricks indirectly by configuring a destination to:

  1. Send data directly to a Snowflake data warehouse.

  2. Send data to Business Intelligence Connect, and then connect Databricks to Business Intelligence Connect using SSO.

  3. Send a CSV file to an Amazon S3 bucket, after which it may be accessed directly or be picked up by Amazon Redshift.

  4. Send a CSV file to an Azure container, after which it may be accessed directly or be picked up by Azure Synapse Analytics.

  5. Send a CSV file to Google Cloud Storage, after which it may be accessed directly or be picked up by Google BigQuery.

The destination workflow in Amperity may be configured to send data on a regular basis to any of these locations to ensure that data available from Databricks is up to date.

Databricks use cases

Databricks can interact with Amperity data as a Spark SQL interface or as a starting point for downstream workflows, such as:

  1. Inspecting customer files prior to loading them to Amperity.

  2. Investigating SQL that may be required for saved queries.

  3. Investigating SQL that may be required to extend or reshape domain tables.

  4. Validating certain workflows that use the output of the customer 360 database as a starting point.

  5. Enabling machine learning workflows, as a starting point.

Connect to Amazon AWS

You can connect Databricks to any of the following services in Amazon AWS:

Connect to Amazon S3

Amazon Simple Storage Service (Amazon S3) stores customer data files of any size in many file formats.

Amperity can be configured to send data to an Amazon S3 bucket. Databricks can be configured to connect directly to this bucket, and then use Amperity output as a data source.

You may use the Amazon S3 bucket that comes with your Amperity tenant for this step (if your Amperity tenant is running on Amazon AWS). Or you may configure Amperity to send data to an Amazon S3 bucket that your organization manages directly.

To connect Databricks to Amazon S3

The steps required to configure Amperity to send data that is accessible to Databricks from Amazon S3 requires completion of a series of short workflows, some of which must be done outside of Amperity.

  1. Send CSV data to Amazon S3 from Amperity.

  2. Connect CSV data from Amazon S3 , and then access the data sent from Amperity.

  3. Validate the workflow within Amperity and the data within Databricks.

  4. Configure Amperity to automate this workflow for a regular (daily) refresh of data.

Connect to Amazon Redshift

Amazon RedShift is a data warehouse located within Amazon Web Services that can handle massive sets of column-oriented data.

Amperity can be configured to send data to an Amazon S3 bucket, after which Amazon Redshift is configured to load that data. Databricks can be configured to connect to Amazon Redshift and use the Amperity output as a data source.

You may use the Amazon S3 bucket that comes with your Amperity tenant for the intermediate step (if your Amperity tenant is running on Amazon AWS). Or you may configure Amperity to send data to an Amazon S3 bucket that your organization manages directly.

To connect Databricks to Amazon Redshift

The steps required to configure Amperity to send data that is accessible to Databricks from Amazon Redshift requires completion of a series of short workflows, some of which must be done outside of Amperity.

  1. Send CSV data to Amazon S3 from Amperity.

  2. Load CSV data from Amazon S3 to Amazon Redshift.

  3. Connect Databricks to Amazon Redshift , and then access the data sent from Amperity.

  4. Validate the workflow within Amperity and the data within Databricks.

  5. Configure Amperity to automate this workflow for a regular (daily) refresh of data.

Connect to Azure

You can connect Databricks to any of the following services in Azure:

Connect to Azure Blob Storage

Azure Blob Storage is an object storage solution for the cloud that is optimized for storing massive amounts of unstructured data.

Amperity can be configured to send data to an Azure Blob Storage container. Databricks can be configured to connect directly to this container, and then use Amperity output as a data source.

You may use the Azure Blob Storage container that comes with your Amperity tenant for this step (if your Amperity tenant is running on Azure). Or you may configure Amperity to send data to an Azure container that your organization manages directly.

To connect Databricks to Azure Blob Storage

The steps required to configure Amperity to send data that is accessible to Databricks from Azure Blob Storage requires completion of a series of short workflows, some of which must be done outside of Amperity.

  1. Send CSV data to an Azure Blob Storage container from Amperity.

  2. Connect Databricks to Azure Blob Storage , and then access the data sent from Amperity.

  3. Validate the workflow within Amperity and the data within Databricks.

  4. Configure Amperity to automate this workflow for a regular (daily) refresh of data.

Connect to Azure Synapse Analytics

Azure Synapse Analytics is a limitless analytics service that brings together enterprise data warehousing and analytics. Azure Synapse Analytics has four components: SQL analytics, Apache Spark, hybrid data integration, and a unified user experience.

Amperity can be configured to send data to an Azure container, after which Azure Synapse Analytics is configured to load that data. Databricks can be configured to connect to Azure Synapse Analytics and use the Amperity output as a data source.

You may use the Azure Blob Storage container that comes with your Amperity tenant for the intermediate step (if your Amperity tenant is running on Azure). Or you may configure Amperity to send data to an Azure container (Azure Blob Storage or Azure Data Lake Storage) that your organization manages directly.

To connect Databricks to Azure Synapse Analytics

The steps required to configure Amperity to send data that is accessible to Databricks from Azure Synapse Analytics requires completion of a series of short workflows, some of which must be done outside of Amperity.

  1. Send CSV data to an Azure container: Azure Blob Storage.

  2. Load CSV data from the Azure container to Azure Synapse Analytics .

  3. Connect Databricks to Azure Synapse Analytics , and then access the data sent from Amperity.

  4. Validate the workflow within Amperity and the data within Databricks.

  5. Configure Amperity to automate this workflow for a regular (daily) refresh of data.

Connect to Business Intelligence Connect

Business Intelligence Connect is an Amperity-managed cloud data warehouse that provides an easy-to-access location from which you can use any BI tool to access all of your Amperity data.

Business Intelligence Connect supports using SSO to provide access to individual user accounts. Use the same credentials to access the data warehouse as those used to access Amperity.

Important

BI tools that use the JDBC driver must set the Authenticator setting to externalbrowser as a requirement for browser-based SSO . The location in which this setting is configured varies, depending on the BI tool. For example, SQL Workbench appends this setting to the URL for the Business Intelligence Connect data warehouse:

URL/?authenticator=externalbrowser

To connect Databricks to Business Intelligence Connect

  1. Download and install the Snowflake JDBC driver .

  2. Configure Databricks to use the JDBC driver to connect to Business Intelligence Connect.

  3. Enter the following information:

    Setting

    Description

    Driver

    The Snowflake JDBC driver.

    URL

    The URL for the Business Intelligence Connect data warehouse.

    This must start with jdbc:snowflake://, be followed by the URL for the data warehouse, and then appended with ?authenticator=externalbrowser.

    For example:

    jdbc:snowflake://ab12345.snowflakecomputing.com/?authenticator=externalbrowser
    

    Username

    The string token for Business Intelligence Connect.

    Password

    The personal access token for Business Intelligence Connect.

  4. After the JDBC driver is configured to use SSO you may begin authoring and running queries from Databricks against data in the Business Intelligence Connect data warehouse.

Connect to Google BigQuery

Google BigQuery is a fully-managed data warehouse that provides scalable, cost-effective, serverless software that can perform fast analysis over petabytes of data and querying using ANSI SQL.

Amperity can be configured to send data to Google Cloud Storage, after which Google BigQuery is configured to load that data. Databricks can be configured to connect to Google BigQuery and use the Amperity output as a data source.

You must configure Amperity to send data to a Cloud Storage bucket that your organization manages directly.

To connect Databricks to Google BigQuery

The steps required to configure Amperity to send data that is accessible to Databricks from Google BigQuery requires completion of a series of short workflows, some of which must be done outside of Amperity.

  1. Send CSV data to Cloud Storage from Amperity.

  2. Load CSV data from Cloud Storage to Google BigQuery.

  3. Connect Databricks to Google BigQuery , and then access the data sent from Amperity.

  4. Validate the workflow within Amperity and the data within Databricks.

  5. Configure Amperity to automate this workflow for a regular (daily) refresh of data.

Connect to Snowflake

Snowflake is an analytic data warehouse that is fast, easy to use, and flexible. Snowflake uses a SQL database engine that is designed for the cloud. Snowflake can provide tables as a data source to Amperity.

Amperity can be configured to share data (tables and/or entire databases) directly with Snowflake. Databricks can be configured to connect to a Snowflake data warehouse and use that data as a data source.

Amperity offers additional services that allow Amperity to run as the Amperity Data Warehouse, which is synchronized to your Amperity tenant. (This is, currently, run as a Snowflake data warehouse that is accessible to only your Amperity tenant.) Databricks is configured to connect directly to the Amperity Data Warehouse. Or you may send data directly to Snowflake by configuring the Snowflake destination to send data to your Snowflake tenant.

To connect Databricks to a Snowflake data warehouse

The steps required to configure Amperity to send data that is accessible to Databricks from a Snowflake data warehouse requires completion of a series of short workflows, some of which must be done outside of Amperity.

  1. Configure Snowflake objects for the correct database, tables, roles, and users. (Refer to the Amazon S3 or Azure tutorial, as appropriate for your tenant.)

  2. Send data to Snowflake from Amperity. (Refer to the Amazon S3 or Azure tutorial, as appropriate for your tenant.)

  3. Connect Databricks to Snowflake , and then access the data sent from Amperity.

    Note

    The URL for the Snowflake data warehouse, the Snowflake username, the password, and the name of the Snowflake data warehouse are sent to the Databricks user within a SnapPass link. Request this information from your Amperity representative prior to attempting to connect Databricks to Snowflake.

  4. Validate the workflow within Amperity and the data within Databricks.

  5. Configure Amperity to automate this workflow for a regular (daily) refresh of data.

Note

Snowflake can be configured to run in Amazon AWS or Azure. When using the Amazon Data Warehouse you will use the same cloud platform as your Amperity tenant. When using your own instance of Snowflake, you should use the same Amazon S3 bucket or Azure Blob Storage container that is included with your tenant when configuring Snowflake for data sharing, but then connect Databricks directly to your own instance of Snowflake.