SlideShare a Scribd company logo
Simplifying the Creation of Spark
Pipelines with Yaetos
Meetup @ Spark Barcelona
Arthur Prévot - 2022-06-28
Table Of Content
● Spark
● What is Yaetos?
● Jobs, Pipelines and Manifest
● Setup and Test
● From Prototype to Production
● Job Parameters
● Workflow
● Other Features
● Demo
Spark is Amazing
● Pushing for data lake architecture
○ ↘💰 storage , ↗󰤔 to organize data
● Strong support for various programming languages and SQL
○ Allowing software dev best practises
● Ability to run locally
● Pushing towards more open source
More work needed to operationalize it
● It needs external tooling
○ Computer resources
○ Storage
○ Scheduling
● It needs extra code to deal with data-eng problems
○ Dataset dependencies
○ Idempotence
○ Unit-testing
○ Dataset cataloging…
● It is often seen as overkill for small jobs
○ -> pandas good enough
An open source tool for data engineers,
scientists, and analysts to easily create data
pipelines in python and SQL and put them in
production in AWS.
It integrates all tools necessary to create a data
stack, relying only on open source engines (Spark
or Pandas) and hosted services (AWS). It can be
setup in minutes.
Used in prod at Adevinta and The Hotels Network,
100+ datasets updated daily for 2+ years
What is YAETOS?
Like a Swiss Army Knife
● Dataset on disk (S3, local)
● Unstructured data on disk
● MySQL DB tables
● PostGres DB Tables
● Redshift DB Tables
● API services (Salesforce,
Stripe, or other)
Engine
● Spark
● Pandas
● Dataset on disk (S3, local)
● Unstructured data on disk
● MySQL DB tables
● PostGres DB Tables
● Redshift DB Tables
● API services (Salesforce,
Stripe, or other)
Resources
● AWS EMR
Scheduling
● AWS Data Pipeline
Input Output
Engine/Res./Sched.
With Room for Expansion
Good candidates for addition:
Engine
● Spark
● Pandas
Resources
● EMR
● Kubernetes
● AWS Lambda
Scheduling
● AWS Data Pipeline
● Airflow
Engine/Res./Sched.
Definition of a Job
ex1_sql_job.sql ex1_pyspark_job.py ex1_unframed_job.py
More flexibility
Less simplicity
python jobs/launcher.py
--job_name=examples/ex1_sql_job.sql
python jobs/ex1_pyspark_job.py
python jobs/ex1_unframed_job.py
Job Details
ex1_pyspark_job.py
Framework part: loading dfs + more
Transform code
Params + link to more params
Framework: Command line + exec
The Jobs Manifest
List of jobs with jobs
metadata (IO,
dependencies,
scheduling info)
Can contain hundreds of
jobs
Can be split across
several files if needed
(per company dept.)
Human readable,
computer parse-able
jobs_metadata.yml
From the Manifest to Job Files
ex_sql_job.sql
jobs_metadata.yml
ex_pyspark_job.py
ex_pandas_job.py
Definition of a Pipeline
Pipeline = job running with its dependencies, as defined in the job manifest
How to get it setup?
Running this in a terminal
$ pip install yaetos
$ yaetos setup --project=my_yaetos_jobs
To run sample jobs
host $ cd my_yaetos_jobs/
host $ yaetos launch_docker_bash
# Running 1 job
guest $ python jobs/example/ex0_extraction_job.py
# Running 1 pipeline
guest $ python jobs/example/ex1_framework_job.py
–-dependencies
Where does it live ?
… in a folder, ready for github -> shareable
The files to create pipelines:
● Manifest
● Job code (python or SQL)
● Job unit-tests (optional)
Jobs can run in local or be pushed to the cloud
How to Inspect a job in Jupyter
From prototype to production
Same command, different argument. No updates to job code
host $ yaetos launch_docker_bash
guest $ python path/to/some_job.py # i.e. local
guest $ python path/to/some_job.py —deploy=EMR
guest $ python path/to/some_job.py —deploy=EMR_Scheduled
Running a Pipeline Locally
FS
CPU
On execution:
1. Load input dataset from FS
2. Load secrets from FS
3. Process data
4. Save output to FS
5. Repeat with next dependant job if any
$ python path/to/some_job.py
Running a Pipeline in the Cloud
EMR
Cluster S3 FS
On execution:
1. Zip repo files (without secrets)
2. Send zip to S3
3. Send secrets to AWS Secrets
4. Creates EMR cluster in AWS if req.
5. Load input datasets from S3
6. Load secrets from AWS
7. Process data in EMR
8. Save output to S3
9. Repeat with next dependant job if any
10. Kill cluster
AWS
Secrets
$ python path/to/some_job.py —-deploy=EMR
Running a Pipeline in the Cloud on a Schedule
EMR
Cluster S3 FS
On execution:
1. Zip repo files (without secrets)
2. Send zip to S3
3. Send secrets to AWS Secrets
4. Configure schedule in AWS Data
Pipeline (deactivate previous if any)
When scheduled time reached:
5. Create EMR cluster in AWS
6. Load input datasets from S3
7. Load secrets from AWS
8. Process data in EMR
9. Save output to S3
10. Repeat with next dependant job if any
11. Kill cluster
Repeat at next scheduled time
AWS
Secrets
AWS
Data
Pipeline
$ python path/to/some_job.py —-deploy=EMR_Schedule
Running a Pipeline in the Cloud on a Schedule, cont’d
Where do I track my jobs in the cloud ?
In the standard AWS UIs for each service
Resources
(EMR)
Storage
(S3)
Scheduling (AWS Data Pipeline)
Job Parameters
Input/Output
Copy Redshift
Dependencies
Size and # machines
Scheduling info
Email if failing
Custom param
Workflow
● (Optional) Write unit-test
● Write transform locally
○ Test on unit-tests, or on locally dataset copy
○ Put all parameters in job for faster iterations.
● Test in the cloud
● PR in github, merge, deploy to the scheduling tool
○ Suggest putting important parameters in the manifest (jobs_metadata.yml)
○ Use “--mode=prod_EMR ” to use the parameters associated to production in the cloud (such
as the base_path, the database schema to use…)
Pipeline Unit-Test
Other Features
● Fairly clean logs
● Secret management (see conf/connections.cfg)
● Automation of folder structure to store previous versions with timestamps
● Support for idempotente incremental pipelines (daily)
● Support gitops: Git hash in logs + prompt if code not clean before publishing
● Inferred schemas documented automatically in yaml file in repo
● Unit-testing
● saving and loading ML model files instead of dataset
● More example jobs available in main repo
Demo
More Details
● https://medium.com/@arthurprevot/yaetos-data-framework-description-ddc7
1caf6ce
● Standalone repo (framework + sample jobs):
https://github.com/arthurprevot/yaetos
● Jobs only repo (sample jobs only, framework from pip installed yaetos)
https://github.com/arthurprevot/yaetos_jobs
Found it interesting ?
● Please help make it more visible -> add “star”
https://github.com/arthurprevot/yaetos
● Get in touch if you have questions, at
arthur@yaetos.com
● Lots of room for improvements, any help
welcome !
Thank you !
Questions ?

More Related Content

Yaetos_Meetup_SparkBCN_v1.pdf

  • 1. Simplifying the Creation of Spark Pipelines with Yaetos Meetup @ Spark Barcelona Arthur Prévot - 2022-06-28
  • 2. Table Of Content ● Spark ● What is Yaetos? ● Jobs, Pipelines and Manifest ● Setup and Test ● From Prototype to Production ● Job Parameters ● Workflow ● Other Features ● Demo
  • 3. Spark is Amazing ● Pushing for data lake architecture ○ ↘💰 storage , ↗󰤔 to organize data ● Strong support for various programming languages and SQL ○ Allowing software dev best practises ● Ability to run locally ● Pushing towards more open source
  • 4. More work needed to operationalize it ● It needs external tooling ○ Computer resources ○ Storage ○ Scheduling ● It needs extra code to deal with data-eng problems ○ Dataset dependencies ○ Idempotence ○ Unit-testing ○ Dataset cataloging… ● It is often seen as overkill for small jobs ○ -> pandas good enough
  • 5. An open source tool for data engineers, scientists, and analysts to easily create data pipelines in python and SQL and put them in production in AWS. It integrates all tools necessary to create a data stack, relying only on open source engines (Spark or Pandas) and hosted services (AWS). It can be setup in minutes. Used in prod at Adevinta and The Hotels Network, 100+ datasets updated daily for 2+ years What is YAETOS?
  • 6. Like a Swiss Army Knife ● Dataset on disk (S3, local) ● Unstructured data on disk ● MySQL DB tables ● PostGres DB Tables ● Redshift DB Tables ● API services (Salesforce, Stripe, or other) Engine ● Spark ● Pandas ● Dataset on disk (S3, local) ● Unstructured data on disk ● MySQL DB tables ● PostGres DB Tables ● Redshift DB Tables ● API services (Salesforce, Stripe, or other) Resources ● AWS EMR Scheduling ● AWS Data Pipeline Input Output Engine/Res./Sched.
  • 7. With Room for Expansion Good candidates for addition: Engine ● Spark ● Pandas Resources ● EMR ● Kubernetes ● AWS Lambda Scheduling ● AWS Data Pipeline ● Airflow Engine/Res./Sched.
  • 8. Definition of a Job ex1_sql_job.sql ex1_pyspark_job.py ex1_unframed_job.py More flexibility Less simplicity python jobs/launcher.py --job_name=examples/ex1_sql_job.sql python jobs/ex1_pyspark_job.py python jobs/ex1_unframed_job.py
  • 9. Job Details ex1_pyspark_job.py Framework part: loading dfs + more Transform code Params + link to more params Framework: Command line + exec
  • 10. The Jobs Manifest List of jobs with jobs metadata (IO, dependencies, scheduling info) Can contain hundreds of jobs Can be split across several files if needed (per company dept.) Human readable, computer parse-able jobs_metadata.yml
  • 11. From the Manifest to Job Files ex_sql_job.sql jobs_metadata.yml ex_pyspark_job.py ex_pandas_job.py
  • 12. Definition of a Pipeline Pipeline = job running with its dependencies, as defined in the job manifest
  • 13. How to get it setup? Running this in a terminal $ pip install yaetos $ yaetos setup --project=my_yaetos_jobs To run sample jobs host $ cd my_yaetos_jobs/ host $ yaetos launch_docker_bash # Running 1 job guest $ python jobs/example/ex0_extraction_job.py # Running 1 pipeline guest $ python jobs/example/ex1_framework_job.py –-dependencies
  • 14. Where does it live ? … in a folder, ready for github -> shareable The files to create pipelines: ● Manifest ● Job code (python or SQL) ● Job unit-tests (optional) Jobs can run in local or be pushed to the cloud
  • 15. How to Inspect a job in Jupyter
  • 16. From prototype to production Same command, different argument. No updates to job code host $ yaetos launch_docker_bash guest $ python path/to/some_job.py # i.e. local guest $ python path/to/some_job.py —deploy=EMR guest $ python path/to/some_job.py —deploy=EMR_Scheduled
  • 17. Running a Pipeline Locally FS CPU On execution: 1. Load input dataset from FS 2. Load secrets from FS 3. Process data 4. Save output to FS 5. Repeat with next dependant job if any $ python path/to/some_job.py
  • 18. Running a Pipeline in the Cloud EMR Cluster S3 FS On execution: 1. Zip repo files (without secrets) 2. Send zip to S3 3. Send secrets to AWS Secrets 4. Creates EMR cluster in AWS if req. 5. Load input datasets from S3 6. Load secrets from AWS 7. Process data in EMR 8. Save output to S3 9. Repeat with next dependant job if any 10. Kill cluster AWS Secrets $ python path/to/some_job.py —-deploy=EMR
  • 19. Running a Pipeline in the Cloud on a Schedule EMR Cluster S3 FS On execution: 1. Zip repo files (without secrets) 2. Send zip to S3 3. Send secrets to AWS Secrets 4. Configure schedule in AWS Data Pipeline (deactivate previous if any) When scheduled time reached: 5. Create EMR cluster in AWS 6. Load input datasets from S3 7. Load secrets from AWS 8. Process data in EMR 9. Save output to S3 10. Repeat with next dependant job if any 11. Kill cluster Repeat at next scheduled time AWS Secrets AWS Data Pipeline $ python path/to/some_job.py —-deploy=EMR_Schedule
  • 20. Running a Pipeline in the Cloud on a Schedule, cont’d
  • 21. Where do I track my jobs in the cloud ? In the standard AWS UIs for each service Resources (EMR) Storage (S3) Scheduling (AWS Data Pipeline)
  • 22. Job Parameters Input/Output Copy Redshift Dependencies Size and # machines Scheduling info Email if failing Custom param
  • 23. Workflow ● (Optional) Write unit-test ● Write transform locally ○ Test on unit-tests, or on locally dataset copy ○ Put all parameters in job for faster iterations. ● Test in the cloud ● PR in github, merge, deploy to the scheduling tool ○ Suggest putting important parameters in the manifest (jobs_metadata.yml) ○ Use “--mode=prod_EMR ” to use the parameters associated to production in the cloud (such as the base_path, the database schema to use…)
  • 25. Other Features ● Fairly clean logs ● Secret management (see conf/connections.cfg) ● Automation of folder structure to store previous versions with timestamps ● Support for idempotente incremental pipelines (daily) ● Support gitops: Git hash in logs + prompt if code not clean before publishing ● Inferred schemas documented automatically in yaml file in repo ● Unit-testing ● saving and loading ML model files instead of dataset ● More example jobs available in main repo
  • 26. Demo
  • 27. More Details ● https://medium.com/@arthurprevot/yaetos-data-framework-description-ddc7 1caf6ce ● Standalone repo (framework + sample jobs): https://github.com/arthurprevot/yaetos ● Jobs only repo (sample jobs only, framework from pip installed yaetos) https://github.com/arthurprevot/yaetos_jobs
  • 28. Found it interesting ? ● Please help make it more visible -> add “star” https://github.com/arthurprevot/yaetos ● Get in touch if you have questions, at arthur@yaetos.com ● Lots of room for improvements, any help welcome !