Member-only story

Use All Spark Workers’ CPUs to Read from a REST API

Hubert Dudek
3 min readFeb 9, 2025

--

Access for free via a friend’s link

Recently, we explored how to create our own data sources in Spark by using syntax like

spark.read.format("myrestdatasource")

to read from a REST API. You can find the first article about creating your own data sources here.

Now, we’ll examine how to use all worker CPUs to read simultaneously from a REST API.

Key Idea

Spark will read in parallel if we define multiple InputPartition objects in DataSourceReader.partitions(). Each partition will cause a separate task to read(partition), allowing multiple simultaneous requests to the REST API.

In addition to the read method, we need to implement a partitions method in our custom class, for instance:

 def partitions(self):
"""
Return a list of InputPartition objects. etc.
"""
# Hard-code a couple partitions for demonstration.
# In production, you'd create them dynamically.
return [
InputPartition({'postIdRange': (1, 20)}),
InputPartition({'postIdRange': (21, 40)}),
]

We return a list of partitions. For each partition, we can specify parameters, such as a from/to range or an offset/limit.

We can take it one step further by detecting how many cores we can utilize and then generating the number of partitions to match the number of CPUs…

--

--

No responses yet