Connect Databricks to Amazon S3

Some organizations choose to store their data in Amazon S3, but then use Databricks to enable data scientists, engineers, developers, and data analysts within their organization to use that data, along with a combination of Databricks SQL, R, Scala, and/or Python, to build models and tools that support external BI applications and domain-specific tools to help end-users consume that data through the interface they are most comfortable with.

You may send an Apache Parquet, Apache Avro, CSV, or JSON file from Amperity to Amazon S3, and then connect to that data from Databricks.

What is Amazon S3?

Amazon Simple Storage Service (Amazon S3) stores customer data files of any size in any file formats.

Add workflow

Amperity can be configured to send data to Amazon S3, after which Databricks can be configured to connect to Amazon S3 and use the Amperity output as a data source.

Important

You must configure Amperity to send data to an Amazon S3 bucket that your organization manages directly.

Connect Databricks to Amazon S3.

To connect Databricks to Amazon S3

The steps required to configure Amperity to send data that is accessible to Databricks from Amazon S3 requires completion of a series of short workflows, some of which must be done outside of Amperity.

Step 1.

Use a query to return the data you want to send to Databricks.

Step 2.

Send an Apache Parquet, Apache Avro, CSV, or JSON file to Amazon S3 from Amperity.

Step 3.

Connect CSV data from Amazon S3 , and then access the data sent from Amperity.

Step 4.

Validate the workflow within Amperity and the data within Databricks.

Step 5.

Configure Amperity to automate this workflow for a regular (daily) refresh of data.