Member-only story

Selenium chrome driver on databricks driver

4 min readNov 9, 2022

On the databricks community, I see repeated problems regarding the selenium installation on the databricks driver. Installing selenium on databricks can be surprising, but for example, sometimes we need to grab some datasets behind fancy authentication, and selenium is the most accessible tool to do that. Of course, always remember to check the most uncomplicated alternatives first. For example, if we need to download an HTML file, we can use SparkContext.addFile() or just use the requests library. If we need to parse HTML without simulating user actions or downloading complicated pages, we can use BeautifulSoap. Please remember that selenium is running on the driver only (workers are not utilized), so just for the selenium part single node cluster is the preferred setting.

Installation

The easiest solution is to use apt-get to install ubuntu packages, but often version in the ubuntu repo is outdated. Recently that solution stopped working for me, and I decided to take a different approach and to get the driver and binaries from chromium-browser-snapshots https://commondatastorage.googleapis.com/chromium-browser-snapshots/index.html Below script download the newest version of browser binaries and driver. Everything is saved to /tmp/chrome directory. We must also set the chrome home directory to /tmp/chrome/chrome-user-data-dir. Sometimes, chromium complains about missing libraries. That’s why we also install libgbm-dev. The below script will create a bash file implementing mentioned steps.

Selenium chrome driver on databricks driver

Written by Hubert Dudek

No responses yet