Connect Databricks to Amazon Redshift¶
Some organizations choose to store their data in Amazon Redshift, but then use Databricks to enable data scientists, engineers, developers, and data analysts within their organization to use that data, along with a combination of Databricks SQL, R, Scala, and/or Python, to build models and tools that support external BI applications and domain-specific tools to help end-users consume that data through the interface they are most comfortable with.
You may send an Apache Parquet, Apache Avro, CSV, or JSON file from Amperity to Amazon S3, load that data to Amazom Redshift, and then connect to that data from Databricks.
What is Amazon Redshift?
Amazon RedShift is a data warehouse located within Amazon Web Services that can handle massive sets of column-oriented data.
Add workflow¶
Amperity can be configured to send data to Amazon S3, after which Amazon Redshift is configured to load that data from Amazon S3. Databricks can be configured to connect to Amazon Redshift and use the Amperity output as a data source.
Important
You may use the Amazon S3 bucket that comes with your Amperity tenant for the intermediate step (if your Amperity tenant is running on Amazon AWS). Or you may configure Amperity to send data to an Amazon S3 bucket that your organization manages directly.
To connect Databricks to Amazon Redshift
The steps required to configure Amperity to send data that is accessible to Databricks from Amazon Redshift requires completion of a series of short workflows, some of which must be done outside of Amperity.
Use a query to return the data you want to send to Databricks. |
|
Send an Apache Parquet, Apache Avro, CSV, or JSON file to Amazon S3 from Amperity. |
|
Load CSV data from Amazon S3 to Amazon Redshift. |
|
Connect Databricks to Amazon Redshift , and then access the data sent from Amperity. |
|
Validate the workflow within Amperity and the data within Databricks. |
|
Configure Amperity to automate this workflow for a regular (daily) refresh of data. |