The official Databricks/Azure documentation to access a Azure Blob Storage suggests following ways to authenticate:
All of those methods require hard-coded credentials somewhere in the code or the environment, which is something i would like to avoid - especially if all of the code is stored in a git repository.
Fortunately, there is a better way - using managed identities. It is not documented properly, but can be used just as well:
dbmanagedidentity
. Get the client id of this managed identity.Storage Blob Data Contributor
role to this managed identity, scoped to the storage account you want to access.Use the following code to access the storage account:
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient
from io import StringIO
import pandas as pd
account_url = "https://databricksfabian.blob.core.windows.net/"
container_name = "test"
blob_name = "example.csv"
credential = DefaultAzureCredential()
blob_service_client = BlobServiceClient(account_url=account_url, credential=credential)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
data = blob_client.download_blob(encoding='utf8').readall()
df = pd.read_csv(StringIO(data))
Use the following code to access the storage account:
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider")
spark.conf.set("fs.azure.account.oauth2.msi.tenant", "<tenant-id>")
spark.conf.set("fs.azure.account.oauth2.client.id", "<managed-identity-client-id>")
df = spark.read.format("csv").load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<file-name>")