Member-only story

Use All Spark Workers’ CPUs to Read from a REST API

3 min readFeb 9, 2025

Recently, we explored how to create our own data sources in Spark by using syntax like

spark.read.format("myrestdatasource")

to read from a REST API. You can find the first article about creating your own data sources here.

Now, we’ll examine how to use all worker CPUs to read simultaneously from a REST API.

Key Idea

Spark will read in parallel if we define multiple InputPartition objects in DataSourceReader.partitions(). Each partition will cause a separate task to read(partition), allowing multiple simultaneous requests to the REST API.

In addition to the read method, we need to implement a partitions method in our custom class, for instance:

 def partitions(self):
        """
        Return a list of InputPartition objects. etc.
        """
        # Hard-code a couple partitions for demonstration.
        # In production, you'd create them dynamically.
        return [
            InputPartition({'postIdRange': (1, 20)}),
            InputPartition({'postIdRange': (21, 40)}),
        ]

We return a list of partitions. For each partition, we can specify parameters, such as a from/to range or an offset/limit.

We can take it one step further by detecting how many cores we can utilize and then generating the number of partitions to match the number of CPUs…

Use All Spark Workers’ CPUs to Read from a REST API

Key Idea

Written by Hubert Dudek

No responses yet