Member-only story
Use All Spark Workers’ CPUs to Read from a REST API
Access for free via a friend’s link
Recently, we explored how to create our own data sources in Spark by using syntax like
spark.read.format("myrestdatasource")
to read from a REST API. You can find the first article about creating your own data sources here.
Now, we’ll examine how to use all worker CPUs to read simultaneously from a REST API.
Key Idea
Spark will read in parallel if we define multiple InputPartition objects in DataSourceReader.partitions(). Each partition will cause a separate task to read(partition), allowing multiple simultaneous requests to the REST API.
In addition to the read method, we need to implement a partitions method in our custom class, for instance:
def partitions(self):
"""
Return a list of InputPartition objects. etc.
"""
# Hard-code a couple partitions for demonstration.
# In production, you'd create them dynamically.
return [
InputPartition({'postIdRange': (1, 20)}),
InputPartition({'postIdRange': (21, 40)}),
]
We return a list of partitions. For each partition, we can specify parameters, such as a from/to range or an offset/limit.
We can take it one step further by detecting how many cores we can utilize and then generating the number of partitions to match the number of CPUs…