0

I have configured a AWS glue job previously Now I have added option in my CDK to enable/disable/pause bookmark while creating the job using --job-bookmark-option param

I have verified in the AWS console that Job Bookmarking is "Enable" when then above option is set to enable while creating the JOB

Now I am trying to read a CSV file from S3 in the glue job, but even when bookmarking is enabled my glue job reads the whole csv file on every run

Glue Job Code

from awsglue.context import GlueContext
from pyspark.context import SparkContext

class MyGlueTesJob:
    @classmethod
    def execute(cls, sparkContext: SparkContext, parameters: dict):

        glue_context = GlueContext(sparkContext)

        csvFilePath = parameters.get("csvFilePath")
        jobBookMarkKey = parameters.get("jobBookMarkKey")

        print("S3 Path:", csvFilePath)
        print("Bookmark Key:", jobBookMarkKey)

        if not csvFilePath:
            raise ValueError("S3 path not found in parameters.")

        # Read data from S3 using Glue DynamicFrame
        dynamic_frame = glue_context.create_dynamic_frame.from_options(
            connection_type="s3",
            format="csv",
            connection_options={
                "paths": [csvFilePath],
                "jobBookmarkKeys": [jobBookMarkKey],
                "jobBookmarkKeysSortOrder": "asc"  # Specify the sort order
            },
            transformation_ctx="glueTest"
        )

        dataframe = dynamic_frame.toDF()

        # Print DataFrame content
        print("Content of the DataFrame:")
        dataframe.show()

        print("Parameters:", parameters)

        # Return job result
        return "JobResult"


ETL Script

Class CustomGlueJob
    def main(self):
        args = getResolvedOptions(Sys.argv, [
            self.JOB_NAME, self.JOB_CLASS, self.ARGS,
            self.JOB_ID, self.OPTIONAL_ARGS, self.MODULE
        ])
        print(args)
        glueContext = GlueContext(self.spark_context)
        job = Job(glueContext)
        job.init(args[self.JOB_NAME], args)
        self._jvm.GlueJob.init(self._glue_context, (self._jvm.PythonUtils.toScalaMap(args)))
        result = GlueJob.execute(self.spark_context, args[self.JOB_CLASS], args[self.MODULE_NAME], args[self.JOB_ARGS])
        self._jvm.GlueJob.commit((self._jvm.PythonUtils.toScalaMap(result.toMap())), (self._jvm.PythonUtils.toScalaMap({})))
        job.commit()

glue_job = CustomGlueJob()
glue_job.main()

I have few wrapper classes and I have API to run the job,I may have missed something while editing the code to remove sensitive data in question but everything is working fine and no run time errors, But bookmarking is not working as expected

What I am expecting :

  • First run : Prints All CSV data
  • Second run : Should not print anything as no new data is added and bookmarking is enabled
  • Now add 2 new rows to the CSV file
  • Third run : Prints only 2 new Rows from CSV as bookmarking is enabled

What I am seeing :

  • First run : Prints All CSV data
  • Second run : Prints All CSV data
  • Now add 2 new rows to the CSV file
  • Third run : Prints All CSV data
5
  • Just to clarify with glue bookmarks and files in s3 it is tracking whether a file has been processed or not, not the actual records within that file. So you would never have your expected output of the third run only printing the 2 new rows. It would print all the rows again as modifying the file would cause it to be reprocess, documentation makes note of that docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
    – Tim Mylott
    Commented May 6 at 18:30
  • What happens in the second run, it should print empty as I didn't change anything but it is printing all the data again. And As you mentioned it reprocess the data how can I verify the bookmark functionality in that case, read folder->add new files->read folder again (only new file should be considered) ? Commented May 6 at 19:34
  • I would have expected with your second run to return nothing and then your third run to return everything again, so to me it appears bookmark isn't working for you. what is the value for the parameter --job-bookmark-option that you setting? If it is job-bookmark-enable then something in the code isn't right, just not sure what at the moment.
    – Tim Mylott
    Commented May 6 at 20:15
  • In your CDK code use "--job-bookmark-option": 'job-bookmark-enable'. I think you used 'enable'. Commented May 6 at 20:35
  • I have verified "--job-bookmark-option": 'job-bookmark-enable' it is setting properly Commented May 7 at 4:31

1 Answer 1

0

I had a similar issues and found that, even though you are using job version in the CDK, it keeps recreating the job, thus deleting the job bookmarks.
I tried different ways, but ended up using custom job bookmarks with code as opposed to letting AWS Glue handle it. Each CDK cycle, it will wipe out the job bookmarks - even with job versioning on.
Apparently, it does a silent destroy and then deploy.

1
  • Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
    – Community Bot
    Commented Jul 4 at 7:59

Not the answer you're looking for? Browse other questions tagged or ask your own question.