AWS Glue Job with bookmark enabled, Reads same data from S3 CSV file

Question

I have configured a AWS glue job previously Now I have added option in my CDK to enable/disable/pause bookmark while creating the job using --job-bookmark-option param

I have verified in the AWS console that Job Bookmarking is "Enable" when then above option is set to enable while creating the JOB

Now I am trying to read a CSV file from S3 in the glue job, but even when bookmarking is enabled my glue job reads the whole csv file on every run

Glue Job Code

from awsglue.context import GlueContext
from pyspark.context import SparkContext

class MyGlueTesJob:
    @classmethod
    def execute(cls, sparkContext: SparkContext, parameters: dict):

        glue_context = GlueContext(sparkContext)

        csvFilePath = parameters.get("csvFilePath")
        jobBookMarkKey = parameters.get("jobBookMarkKey")

        print("S3 Path:", csvFilePath)
        print("Bookmark Key:", jobBookMarkKey)

        if not csvFilePath:
            raise ValueError("S3 path not found in parameters.")

        # Read data from S3 using Glue DynamicFrame
        dynamic_frame = glue_context.create_dynamic_frame.from_options(
            connection_type="s3",
            format="csv",
            connection_options={
                "paths": [csvFilePath],
                "jobBookmarkKeys": [jobBookMarkKey],
                "jobBookmarkKeysSortOrder": "asc"  # Specify the sort order
            },
            transformation_ctx="glueTest"
        )

        dataframe = dynamic_frame.toDF()

        # Print DataFrame content
        print("Content of the DataFrame:")
        dataframe.show()

        print("Parameters:", parameters)

        # Return job result
        return "JobResult"

ETL Script

Class CustomGlueJob
    def main(self):
        args = getResolvedOptions(Sys.argv, [
            self.JOB_NAME, self.JOB_CLASS, self.ARGS,
            self.JOB_ID, self.OPTIONAL_ARGS, self.MODULE
        ])
        print(args)
        glueContext = GlueContext(self.spark_context)
        job = Job(glueContext)
        job.init(args[self.JOB_NAME], args)
        self._jvm.GlueJob.init(self._glue_context, (self._jvm.PythonUtils.toScalaMap(args)))
        result = GlueJob.execute(self.spark_context, args[self.JOB_CLASS], args[self.MODULE_NAME], args[self.JOB_ARGS])
        self._jvm.GlueJob.commit((self._jvm.PythonUtils.toScalaMap(result.toMap())), (self._jvm.PythonUtils.toScalaMap({})))
        job.commit()

glue_job = CustomGlueJob()
glue_job.main()

I have few wrapper classes and I have API to run the job,I may have missed something while editing the code to remove sensitive data in question but everything is working fine and no run time errors, But bookmarking is not working as expected

What I am expecting :

First run : Prints All CSV data
Second run : Should not print anything as no new data is added and bookmarking is enabled
Now add 2 new rows to the CSV file
Third run : Prints only 2 new Rows from CSV as bookmarking is enabled

What I am seeing :

First run : Prints All CSV data
Second run : Prints All CSV data
Now add 2 new rows to the CSV file
Third run : Prints All CSV data

Just to clarify with glue bookmarks and files in s3 it is tracking whether a file has been processed or not, not the actual records within that file. So you would never have your expected output of the third run only printing the 2 new rows. It would print all the rows again as modifying the file would cause it to be reprocess, documentation makes note of that docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html — Tim Mylott, Commented May 6 at 18:30
What happens in the second run, it should print empty as I didn't change anything but it is printing all the data again. And As you mentioned it reprocess the data how can I verify the bookmark functionality in that case, read folder->add new files->read folder again (only new file should be considered) ? — rajeswar reddy Meka, Commented May 6 at 19:34
I would have expected with your second run to return nothing and then your third run to return everything again, so to me it appears bookmark isn't working for you. what is the value for the parameter --job-bookmark-option that you setting? If it is job-bookmark-enable then something in the code isn't right, just not sure what at the moment. — Tim Mylott, Commented May 6 at 20:15
In your CDK code use "--job-bookmark-option": 'job-bookmark-enable'. I think you used 'enable'. — Noel Llevares, Commented May 6 at 20:35
I have verified "--job-bookmark-option": 'job-bookmark-enable' it is setting properly — rajeswar reddy Meka, Commented May 7 at 4:31

SmellyCat · Accepted Answer · 2024-07-08 18:03:13Z

0

I had a similar issues and found that, even though you are using job version in the CDK, it keeps recreating the job, thus deleting the job bookmarks.
I tried different ways, but ended up using custom job bookmarks with code as opposed to letting AWS Glue handle it. Each CDK cycle, it will wipe out the job bookmarks - even with job versioning on.
Apparently, it does a silent destroy and then deploy.

edited Jul 8 at 18:03

SmellyCat

6596 silver badges14 bronze badges

answered Jul 3 at 12:23

JP Fortin

1

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
– Community Bot
Commented Jul 4 at 7:59

Add a comment |

Collectives™ on Stack Overflow

AWS Glue Job with bookmark enabled, Reads same data from S3 CSV file

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
amazon-web-services
amazon-s3
aws-glue
aws-glue-connection
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged amazon-web-servicesamazon-s3aws-glueaws-glue-connection or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
amazon-web-services
amazon-s3
aws-glue
aws-glue-connection
or ask your own question.