Yaetos_Meetup_SparkBCN_v1.pdf

Simplifying the Creation of Spark
Pipelines with Yaetos
Meetup @ Spark Barcelona
Arthur Prévot - 2022-06-28

Table Of Content
● Spark
● What is Yaetos?
● Jobs, Pipelines and Manifest
● Setup and Test
● From Prototype to Production
● Job Parameters
● Workﬂow
● Other Features
● Demo

Spark is Amazing
● Pushing for data lake architecture
○ ↘💰 storage , ↗󰤔 to organize data
● Strong support for various programming languages and SQL
○ Allowing software dev best practises
● Ability to run locally
● Pushing towards more open source

More work needed to operationalize it
● It needs external tooling
○ Computer resources
○ Storage
○ Scheduling
● It needs extra code to deal with data-eng problems
○ Dataset dependencies
○ Idempotence
○ Unit-testing
○ Dataset cataloging…
● It is often seen as overkill for small jobs
○ -> pandas good enough

An open source tool for data engineers,
scientists, and analysts to easily create data
pipelines in python and SQL and put them in
production in AWS.
It integrates all tools necessary to create a data
stack, relying only on open source engines (Spark
or Pandas) and hosted services (AWS). It can be
setup in minutes.
Used in prod at Adevinta and The Hotels Network,
100+ datasets updated daily for 2+ years
What is YAETOS?

Like a Swiss Army Knife
● Dataset on disk (S3, local)
● Unstructured data on disk
● MySQL DB tables
● PostGres DB Tables
● Redshift DB Tables
● API services (Salesforce,
Stripe, or other)
Engine
● Spark
● Pandas
● Dataset on disk (S3, local)
● Unstructured data on disk
● MySQL DB tables
● PostGres DB Tables
● Redshift DB Tables
● API services (Salesforce,
Stripe, or other)
Resources
● AWS EMR
Scheduling
● AWS Data Pipeline
Input Output
Engine/Res./Sched.

With Room for Expansion
Good candidates for addition:
Engine
● Spark
● Pandas
Resources
● EMR
● Kubernetes
● AWS Lambda
Scheduling
● AWS Data Pipeline
● Airflow
Engine/Res./Sched.

Deﬁnition of a Job
ex1_sql_job.sql ex1_pyspark_job.py ex1_unframed_job.py
More flexibility
Less simplicity
python jobs/launcher.py
--job_name=examples/ex1_sql_job.sql
python jobs/ex1_pyspark_job.py
python jobs/ex1_unframed_job.py

Job Details
ex1_pyspark_job.py
Framework part: loading dfs + more
Transform code
Params + link to more params
Framework: Command line + exec

The Jobs Manifest
List of jobs with jobs
metadata (IO,
dependencies,
scheduling info)
Can contain hundreds of
jobs
Can be split across
several ﬁles if needed
(per company dept.)
Human readable,
computer parse-able
jobs_metadata.yml

From the Manifest to Job Files
ex_sql_job.sql
jobs_metadata.yml
ex_pyspark_job.py
ex_pandas_job.py

Deﬁnition of a Pipeline
Pipeline = job running with its dependencies, as deﬁned in the job manifest

How to get it setup?
Running this in a terminal
$ pip install yaetos
$ yaetos setup --project=my_yaetos_jobs
To run sample jobs
host $ cd my_yaetos_jobs/
host $ yaetos launch_docker_bash
# Running 1 job
guest $ python jobs/example/ex0_extraction_job.py
# Running 1 pipeline
guest $ python jobs/example/ex1_framework_job.py
–-dependencies

Where does it live ?
… in a folder, ready for github -> shareable
The ﬁles to create pipelines:
● Manifest
● Job code (python or SQL)
● Job unit-tests (optional)
Jobs can run in local or be pushed to the cloud

How to Inspect a job in Jupyter

From prototype to production
Same command, different argument. No updates to job code
host $ yaetos launch_docker_bash
guest $ python path/to/some_job.py # i.e. local
guest $ python path/to/some_job.py —deploy=EMR
guest $ python path/to/some_job.py —deploy=EMR_Scheduled

Running a Pipeline Locally
FS
CPU
On execution:
1. Load input dataset from FS
2. Load secrets from FS
3. Process data
4. Save output to FS
5. Repeat with next dependant job if any
$ python path/to/some_job.py

Running a Pipeline in the Cloud
EMR
Cluster S3 FS
On execution:
1. Zip repo files (without secrets)
2. Send zip to S3
3. Send secrets to AWS Secrets
4. Creates EMR cluster in AWS if req.
5. Load input datasets from S3
6. Load secrets from AWS
7. Process data in EMR
8. Save output to S3
10. Kill cluster
AWS
Secrets
$ python path/to/some_job.py —-deploy=EMR

Running a Pipeline in the Cloud on a Schedule
EMR
Cluster S3 FS
On execution:
1. Zip repo files (without secrets)
2. Send zip to S3
3. Send secrets to AWS Secrets
4. Configure schedule in AWS Data
Pipeline (deactivate previous if any)
When scheduled time reached:
5. Create EMR cluster in AWS
6. Load input datasets from S3
7. Load secrets from AWS
8. Process data in EMR
9. Save output to S3
11. Kill cluster
Repeat at next scheduled time
AWS
Secrets
AWS
Data
Pipeline
$ python path/to/some_job.py —-deploy=EMR_Schedule

Running a Pipeline in the Cloud on a Schedule, cont’d

Where do I track my jobs in the cloud ?
In the standard AWS UIs for each service
Resources
(EMR)
Storage
(S3)
Scheduling (AWS Data Pipeline)

Job Parameters
Input/Output
Copy Redshift
Dependencies
Size and # machines
Scheduling info
Email if failing
Custom param

Workﬂow
● (Optional) Write unit-test
● Write transform locally
○ Test on unit-tests, or on locally dataset copy
○ Put all parameters in job for faster iterations.
● Test in the cloud
● PR in github, merge, deploy to the scheduling tool
○ Suggest putting important parameters in the manifest (jobs_metadata.yml)
○ Use “--mode=prod_EMR ” to use the parameters associated to production in the cloud (such
as the base_path, the database schema to use…)

Other Features
● Fairly clean logs
● Secret management (see conf/connections.cfg)
● Automation of folder structure to store previous versions with timestamps
● Support for idempotente incremental pipelines (daily)
● Support gitops: Git hash in logs + prompt if code not clean before publishing
● Inferred schemas documented automatically in yaml ﬁle in repo
● Unit-testing
● saving and loading ML model ﬁles instead of dataset
● More example jobs available in main repo

More Details
● https://medium.com/@arthurprevot/yaetos-data-framework-description-ddc7
1caf6ce
● Standalone repo (framework + sample jobs):
https://github.com/arthurprevot/yaetos
● Jobs only repo (sample jobs only, framework from pip installed yaetos)
https://github.com/arthurprevot/yaetos_jobs

Found it interesting ?
● Please help make it more visible -> add “star”
https://github.com/arthurprevot/yaetos
● Get in touch if you have questions, at
arthur@yaetos.com
● Lots of room for improvements, any help
welcome !

Yaetos_Meetup_SparkBCN_v1.pdf

More Related Content

Yaetos_Meetup_SparkBCN_v1.pdf