Datalake Integration¶
Corridor provides the ability to connect to different kinds of data lakes which could have data saved as parquet
, orc
,
avro
, hive tables
or any in any other format.
The user could define a custom file handler that would have the logic to read/write to/from the data lake.
Example¶
The example focuses on creating a data source handler to read from hive tables.
The user needs to inherit the base class: DataSourceHandler
and define the functions:
read_from_location
write_to_location
from corridor_api.config.handlers import DataSourceHandler
class HiveTable(DataSourceHandler):
"""
Consider a case where every data table is a table in the Hive metastore.
The table `location` identifier is the table name.
"""
name = 'hive'
write_format = 'parquet'
def read_from_location(self, location, nrow=None):
try:
import findspark
findspark.init()
import pyspark
except ImportError:
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
data = spark.table(location)
if nrow is not None:
data = data.limit(nrow)
return data
def write_to_location(self, data, location, mode='error'):
return data.write.format(self.write_format).saveAsTable(location)
Configuration¶
Once the handler class is created, it can be set up in api_config.py
as below:
(assuming the handler is defined in a file called hive_table_hander.py
alongside api_config.py
)