SlideShare a Scribd company logo
Euclid Archive System from IAL
perspective - Lessons learnt
Input for splinter on EAS Data Archive
February 5-7 2014, Munich
Martin Melchior and Marco Soldati
Terminology (1)
• DRMS (Distributed Resource Management
System): scheduler for cluster, grid, cloud
• Submission Host: Host through which users
get access to the the scheduler (DRMS)
• Execution Hosts: Computing nodes, the
Submission host MAY be an Execution Host
• Job: one task sent to the DRMS
Terminology (2)
• IAL: Infrastructure abstraction layer
• TaskScheduler API: Current DRM API (by IAL)
• File Access Protocols: File, FTP, HTTP, SFTP
Dataflow in IAL Mock
SDCFR
IAL

Legend
File storage
(FTP, HTTP, File, …)

http://euclid-archive.fr/level0/raw_20140207.fits Science
Community

SDC-xx
EAS

Database (RDB, XMLDB, OODB, …)

Software-Component
Pipeline Task

Euclid Metadata Archive
System

EMA

SDCCH

SDC-EAS

Computing Infrastructure
Submission Host

IAL

Execution Host
Execution
Host
Storage

Submission
Host
Storage

Task
Scheduler

Task
Executor

Pipeline
Task
Dataflow in IAL Mock
SDCFR
IAL

Legend
File storage
(FTP, HTTP, File, …)

http://euclid-archive.fr/level0/raw_20140207.fits Science
Community

SDC-xx
EAS

Database (RDB, XMLDB, OODB, …)

http://euclid-archive.ch/level0/raw_20140207.fits
Software-Component
Pipeline Task

Euclid Metadata Archive
System

EMA

SDCCH

SDC-EAS

Computing Infrastructure
Submission Host

IAL

Execution Host
Execution
Host
Storage

Submission
Host
Storage

Task
Scheduler

Task
Executor

Pipeline
Task
Dataflow in IAL Mock
SDCFR
IAL

Legend
File storage
(FTP, HTTP, File, …)

http://euclid-archive.fr/level0/raw_20140207.fits Science
Community

SDC-xx
EAS

Database (RDB, XMLDB, OODB, …)

http://euclid-archive.ch/level0/raw_20140207.fits
Software-Component
Pipeline Task

Euclid Metadata Archive
System

EMA

SDCCH

SDC-EAS

Computing Infrastructure

Submission Host
file://data/sdc-eas/level0/raw_20140207.fits

IAL

Execution Host
Execution
Host
Storage

Submission
Host
Storage

sftp://data/sub_workspace/level0/raw_20140207.fits

Task
Scheduler

Task
Executor

Pipeline
Task
Dataflow in IAL Mock
SDCFR
IAL

Legend
File storage
(FTP, HTTP, File, …)

http://euclid-archive.fr/level0/raw_20140207.fits Science
Community

SDC-xx
EAS

Database (RDB, XMLDB, OODB, …)

http://euclid-archive.ch/level0/raw_20140207.fits
Software-Component
Pipeline Task

Euclid Metadata Archive
System

EMA

SDCCH

SDC-EAS

Computing Infrastructure

Submission Host
file://data/sdc-eas/level0/raw_20140207.fits

IAL

sftp://data/sub_workspace/level0/raw_20140207.fits

Execution Host
Execution
Host
Storage

Submission
Host
Storage

file://data/sub_host/level0/raw_20140207.fits
file://mnt/exec_workspace/level0/raw_2014
Task
Scheduler

Task
Executor

Pipeline
Task
Dataflow in IAL Mock
SDCFR
IAL

Legend
File storage
(FTP, HTTP, File, …)

http://euclid-archive.fr/level0/raw_20140207.fits Science
Community

SDC-xx
EAS

Database (RDB, XMLDB, OODB, …)

http://euclid-archive.ch/level0/raw_20140207.fits
Software-Component
Pipeline Task

Euclid Metadata Archive
System

EMA

SDCCH

SDC-EAS

Computing Infrastructure

Submission Host
file://data/sdc-eas/level0/raw_20140207.fits

IAL

sftp://data/sub_workspace/level0/raw_20140207.fits

Submission
Host
Storage

Execution Host
Execution
Host
Storage

file://data/sub_host/level0/raw_20140207.fits
file://mnt/exec_workspace/level0/raw_2014
Task
Scheduler

file://exec_workspace/level0/raw_2014
Task
Pipeline
Executor
Task
Lessons Learnt
1a.Pretty error prone to have the correct URL at
the right time
1b.URLs need to be changed in all XML data
objects
Abstraction of file handling is required!
2. Creating three copies of a file is too much!
Reduce!
1. File Handling Abstraction
Submission Host

IAL/COORS/…

Task
Scheduler

Execution Host
Task
Executor

Pipeline
Task

Euclid Metadata Archive
System

Euclid File Access Service (EuFAS™)

EMA

SDC-xx
IAL

SDC-xx
EAS

SDC-EAS

Execution
Host
Storage

Requirements on EuFAS:
• Lookup and retrieve files by properties (i.e unique ID)
• Replicate data on request and/or based on rules
• Add (and remove) files
• Register physical file locations in EMA
• Provide file handling framework/library for “Pipeline Tasks”

Submission
Host
Storage
2. Reduce number of copies

File storage
(FTP, HTTP, File, …)

Science
Community

SDC-xx
IAL

Legend

SDC-xx
EAS

Database (RDB, XMLDB, OODB, …)

Software-Component
Pipeline Task

SDC

SDC-EAS

Euclid Metadata Archive
System

EMA

Computing Infrastructure
Submission Host

IAL

Execution Host
Execution
Host
Storage

Submission
Host
Storage

Task
Scheduler

Task
Executor

Pipeline
Task
2. Reduce number of copies

File storage
(FTP, HTTP, File, …)

Science
Community

SDC-xx
IAL

Legend

SDC-xx
EAS

Database (RDB, XML-DB,
OODB, …)

Software-Component
Pipeline Task

SDC

SDC-EAS

Euclid Metadata Archive
System

EMA

Computing Infrastructure
Submission Host

IAL

Execution Host
Execution
Host
Storage

Submission
Host
Storage

Task
Scheduler

Task
Executor

Pipeline
Task
2. Reduce number of copies

File storage
(FTP, HTTP, File, …)

Science
Community

SDC-xx
IAL

Legend

SDC-xx
EAS

Database (RDB, XMLDB, OODB, …)

Software-Component
Pipeline Task

SDC

SDC-EAS

Euclid Metadata Archive
System

EMA

Computing Infrastructure
Submission Host

IAL

Execution Host
Execution
Host
Storage

Submission
Host
Storage

Task
Scheduler

Task
Executor

Pipeline
Task
2. Reduce number of copies

File storage
(FTP, HTTP, File, …)

Science
Community

SDC-xx
IAL

Legend

SDC-xx
EAS

Database (RDB, XMLDB, OODB, …)

Software-Component
Pipeline Task

SDC

SDC-EAS

Euclid Metadata Archive
System

EMA

Computing Infrastructure
Submission Host

IAL

Execution Host
Execution
Host
Storage

Submission
Host
Storage

Task
Scheduler

Task
Executor

Pipeline
Task

More Related Content

EAS Data Flow lessons learnt

  • 1. Euclid Archive System from IAL perspective - Lessons learnt Input for splinter on EAS Data Archive February 5-7 2014, Munich Martin Melchior and Marco Soldati
  • 2. Terminology (1) • DRMS (Distributed Resource Management System): scheduler for cluster, grid, cloud • Submission Host: Host through which users get access to the the scheduler (DRMS) • Execution Hosts: Computing nodes, the Submission host MAY be an Execution Host • Job: one task sent to the DRMS
  • 3. Terminology (2) • IAL: Infrastructure abstraction layer • TaskScheduler API: Current DRM API (by IAL) • File Access Protocols: File, FTP, HTTP, SFTP
  • 4. Dataflow in IAL Mock SDCFR IAL Legend File storage (FTP, HTTP, File, …) http://euclid-archive.fr/level0/raw_20140207.fits Science Community SDC-xx EAS Database (RDB, XMLDB, OODB, …) Software-Component Pipeline Task Euclid Metadata Archive System EMA SDCCH SDC-EAS Computing Infrastructure Submission Host IAL Execution Host Execution Host Storage Submission Host Storage Task Scheduler Task Executor Pipeline Task
  • 5. Dataflow in IAL Mock SDCFR IAL Legend File storage (FTP, HTTP, File, …) http://euclid-archive.fr/level0/raw_20140207.fits Science Community SDC-xx EAS Database (RDB, XMLDB, OODB, …) http://euclid-archive.ch/level0/raw_20140207.fits Software-Component Pipeline Task Euclid Metadata Archive System EMA SDCCH SDC-EAS Computing Infrastructure Submission Host IAL Execution Host Execution Host Storage Submission Host Storage Task Scheduler Task Executor Pipeline Task
  • 6. Dataflow in IAL Mock SDCFR IAL Legend File storage (FTP, HTTP, File, …) http://euclid-archive.fr/level0/raw_20140207.fits Science Community SDC-xx EAS Database (RDB, XMLDB, OODB, …) http://euclid-archive.ch/level0/raw_20140207.fits Software-Component Pipeline Task Euclid Metadata Archive System EMA SDCCH SDC-EAS Computing Infrastructure Submission Host file://data/sdc-eas/level0/raw_20140207.fits IAL Execution Host Execution Host Storage Submission Host Storage sftp://data/sub_workspace/level0/raw_20140207.fits Task Scheduler Task Executor Pipeline Task
  • 7. Dataflow in IAL Mock SDCFR IAL Legend File storage (FTP, HTTP, File, …) http://euclid-archive.fr/level0/raw_20140207.fits Science Community SDC-xx EAS Database (RDB, XMLDB, OODB, …) http://euclid-archive.ch/level0/raw_20140207.fits Software-Component Pipeline Task Euclid Metadata Archive System EMA SDCCH SDC-EAS Computing Infrastructure Submission Host file://data/sdc-eas/level0/raw_20140207.fits IAL sftp://data/sub_workspace/level0/raw_20140207.fits Execution Host Execution Host Storage Submission Host Storage file://data/sub_host/level0/raw_20140207.fits file://mnt/exec_workspace/level0/raw_2014 Task Scheduler Task Executor Pipeline Task
  • 8. Dataflow in IAL Mock SDCFR IAL Legend File storage (FTP, HTTP, File, …) http://euclid-archive.fr/level0/raw_20140207.fits Science Community SDC-xx EAS Database (RDB, XMLDB, OODB, …) http://euclid-archive.ch/level0/raw_20140207.fits Software-Component Pipeline Task Euclid Metadata Archive System EMA SDCCH SDC-EAS Computing Infrastructure Submission Host file://data/sdc-eas/level0/raw_20140207.fits IAL sftp://data/sub_workspace/level0/raw_20140207.fits Submission Host Storage Execution Host Execution Host Storage file://data/sub_host/level0/raw_20140207.fits file://mnt/exec_workspace/level0/raw_2014 Task Scheduler file://exec_workspace/level0/raw_2014 Task Pipeline Executor Task
  • 9. Lessons Learnt 1a.Pretty error prone to have the correct URL at the right time 1b.URLs need to be changed in all XML data objects Abstraction of file handling is required! 2. Creating three copies of a file is too much! Reduce!
  • 10. 1. File Handling Abstraction Submission Host IAL/COORS/… Task Scheduler Execution Host Task Executor Pipeline Task Euclid Metadata Archive System Euclid File Access Service (EuFAS™) EMA SDC-xx IAL SDC-xx EAS SDC-EAS Execution Host Storage Requirements on EuFAS: • Lookup and retrieve files by properties (i.e unique ID) • Replicate data on request and/or based on rules • Add (and remove) files • Register physical file locations in EMA • Provide file handling framework/library for “Pipeline Tasks” Submission Host Storage
  • 11. 2. Reduce number of copies File storage (FTP, HTTP, File, …) Science Community SDC-xx IAL Legend SDC-xx EAS Database (RDB, XMLDB, OODB, …) Software-Component Pipeline Task SDC SDC-EAS Euclid Metadata Archive System EMA Computing Infrastructure Submission Host IAL Execution Host Execution Host Storage Submission Host Storage Task Scheduler Task Executor Pipeline Task
  • 12. 2. Reduce number of copies File storage (FTP, HTTP, File, …) Science Community SDC-xx IAL Legend SDC-xx EAS Database (RDB, XML-DB, OODB, …) Software-Component Pipeline Task SDC SDC-EAS Euclid Metadata Archive System EMA Computing Infrastructure Submission Host IAL Execution Host Execution Host Storage Submission Host Storage Task Scheduler Task Executor Pipeline Task
  • 13. 2. Reduce number of copies File storage (FTP, HTTP, File, …) Science Community SDC-xx IAL Legend SDC-xx EAS Database (RDB, XMLDB, OODB, …) Software-Component Pipeline Task SDC SDC-EAS Euclid Metadata Archive System EMA Computing Infrastructure Submission Host IAL Execution Host Execution Host Storage Submission Host Storage Task Scheduler Task Executor Pipeline Task
  • 14. 2. Reduce number of copies File storage (FTP, HTTP, File, …) Science Community SDC-xx IAL Legend SDC-xx EAS Database (RDB, XMLDB, OODB, …) Software-Component Pipeline Task SDC SDC-EAS Euclid Metadata Archive System EMA Computing Infrastructure Submission Host IAL Execution Host Execution Host Storage Submission Host Storage Task Scheduler Task Executor Pipeline Task

Editor's Notes

  1. HelloAccess to EAS from IAL. lessons learnt from IAL mockup integrationOutstanding issues
  2. Brief overview of termsDRMS: scheduler, sometimes used as overall term for computing infrastructure.Submission Host: typically SSH connectionExecution Host: computing nodes. In some setups the submission host is an execution host at the same time.Job. Computing entity, Task: Euclid job.
  3. IAL Infrastructure abstraction layer and not interface abstraction layer
  4. Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
  5. Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
  6. Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
  7. Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
  8. Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
  9. 1b.Our XML data model contains references to binary data in quite some places. These references need to be changed or added all along the processing chain.
  10. Rearrangement of previous drawing: colored boxes at the bottomClients of data storages at the topNew layer (Euclid File Access Service) in between (hopefully based on some existing software)File Access Service needs to provide some functionality (by far not exhaustive).
  11. How to reduce the number of copies: Merge the file storages.Red circle: Merge SDC-EAS and Submission host. IAL would not have to copy files anymore. Requires a fast connection from the computing infrastructure to the archive system.Green circle:Merge Submission and Execution host storage, Task Executor does not need to copy files aroundRequires parallel write file system and fast network connection.Purple circle:Merge SDC-EAS, Submission and Execution Host StorageBest solution regarding performance. Puts high demands on Archive system:Support for parallel writes, support for access over the internet, fast network connection between computing infrastructure and archive, petabytes of storage
  12. How to reduce the number of copies: Merge the file storages.Red circle: Merge SDC-EAS and Submission host. IAL would not have to copy files anymore. Requires a fast connection from the computing infrastructure to the archive system.Green circle:Merge Submission and Execution host storage, Task Executor does not need to copy files aroundRequires parallel write file system and fast network connection.Purple circle:Merge SDC-EAS, Submission and Execution Host StorageBest solution regarding performance. Puts high demands on Archive system:Support for parallel writes, support for access over the internet, fast network connection between computing infrastructure and archive, petabytes of storage
  13. How to reduce the number of copies: Merge the file storages.Red circle: Merge SDC-EAS and Submission host. IAL would not have to copy files anymore. Requires a fast connection from the computing infrastructure to the archive system.Green circle:Merge Submission and Execution host storage, Task Executor does not need to copy files aroundRequires parallel write file system and fast network connection.Purple circle:Merge SDC-EAS, Submission and Execution Host StorageBest solution regarding performance. Puts high demands on Archive system:Support for parallel writes, support for access over the internet, fast network connection between computing infrastructure and archive, petabytes of storage
  14. How to reduce the number of copies: Merge the file storages.Red circle: Merge SDC-EAS and Submission host. IAL would not have to copy files anymore. Requires a fast connection from the computing infrastructure to the archive system.Green circle:Merge Submission and Execution host storage, Task Executor does not need to copy files aroundRequires parallel write file system and fast network connection.Purple circle:Merge SDC-EAS, Submission and Execution Host StorageBest solution regarding performance. Puts high demands on Archive system:Support for parallel writes, support for access over the internet, fast network connection between computing infrastructure and archive, petabytes of storage