I have configured a AWS glue job previously Now I have added option in my CDK to enable/disable/pause bookmark while creating the job using --job-bookmark-option param
I have verified in the AWS console that Job Bookmarking is "Enable" when then above option is set to enable while creating the JOB
Now I am trying to read a CSV file from S3 in the glue job, but even when bookmarking is enabled my glue job reads the whole csv file on every run
Glue Job Code
from awsglue.context import GlueContext
from pyspark.context import SparkContext
class MyGlueTesJob:
@classmethod
def execute(cls, sparkContext: SparkContext, parameters: dict):
glue_context = GlueContext(sparkContext)
csvFilePath = parameters.get("csvFilePath")
jobBookMarkKey = parameters.get("jobBookMarkKey")
print("S3 Path:", csvFilePath)
print("Bookmark Key:", jobBookMarkKey)
if not csvFilePath:
raise ValueError("S3 path not found in parameters.")
# Read data from S3 using Glue DynamicFrame
dynamic_frame = glue_context.create_dynamic_frame.from_options(
connection_type="s3",
format="csv",
connection_options={
"paths": [csvFilePath],
"jobBookmarkKeys": [jobBookMarkKey],
"jobBookmarkKeysSortOrder": "asc" # Specify the sort order
},
transformation_ctx="glueTest"
)
dataframe = dynamic_frame.toDF()
# Print DataFrame content
print("Content of the DataFrame:")
dataframe.show()
print("Parameters:", parameters)
# Return job result
return "JobResult"
ETL Script
Class CustomGlueJob
def main(self):
args = getResolvedOptions(Sys.argv, [
self.JOB_NAME, self.JOB_CLASS, self.ARGS,
self.JOB_ID, self.OPTIONAL_ARGS, self.MODULE
])
print(args)
glueContext = GlueContext(self.spark_context)
job = Job(glueContext)
job.init(args[self.JOB_NAME], args)
self._jvm.GlueJob.init(self._glue_context, (self._jvm.PythonUtils.toScalaMap(args)))
result = GlueJob.execute(self.spark_context, args[self.JOB_CLASS], args[self.MODULE_NAME], args[self.JOB_ARGS])
self._jvm.GlueJob.commit((self._jvm.PythonUtils.toScalaMap(result.toMap())), (self._jvm.PythonUtils.toScalaMap({})))
job.commit()
glue_job = CustomGlueJob()
glue_job.main()
I have few wrapper classes and I have API to run the job,I may have missed something while editing the code to remove sensitive data in question but everything is working fine and no run time errors, But bookmarking is not working as expected
What I am expecting :
- First run : Prints All CSV data
- Second run : Should not print anything as no new data is added and bookmarking is enabled
- Now add 2 new rows to the CSV file
- Third run : Prints only 2 new Rows from CSV as bookmarking is enabled
What I am seeing :
- First run : Prints All CSV data
- Second run : Prints All CSV data
- Now add 2 new rows to the CSV file
- Third run : Prints All CSV data
--job-bookmark-option
that you setting? If it isjob-bookmark-enable
then something in the code isn't right, just not sure what at the moment."--job-bookmark-option": 'job-bookmark-enable'
. I think you used'enable'
.