EAS Data Flow lessons learnt
- 1. Euclid Archive System from IAL
perspective - Lessons learnt
Input for splinter on EAS Data Archive
February 5-7 2014, Munich
Martin Melchior and Marco Soldati
- 2. Terminology (1)
• DRMS (Distributed Resource Management
System): scheduler for cluster, grid, cloud
• Submission Host: Host through which users
get access to the the scheduler (DRMS)
• Execution Hosts: Computing nodes, the
Submission host MAY be an Execution Host
• Job: one task sent to the DRMS
- 3. Terminology (2)
• IAL: Infrastructure abstraction layer
• TaskScheduler API: Current DRM API (by IAL)
• File Access Protocols: File, FTP, HTTP, SFTP
- 4. Dataflow in IAL Mock
SDCFR
IAL
Legend
File storage
(FTP, HTTP, File, …)
http://euclid-archive.fr/level0/raw_20140207.fits Science
Community
SDC-xx
EAS
Database (RDB, XMLDB, OODB, …)
Software-Component
Pipeline Task
Euclid Metadata Archive
System
EMA
SDCCH
SDC-EAS
Computing Infrastructure
Submission Host
IAL
Execution Host
Execution
Host
Storage
Submission
Host
Storage
Task
Scheduler
Task
Executor
Pipeline
Task
- 5. Dataflow in IAL Mock
SDCFR
IAL
Legend
File storage
(FTP, HTTP, File, …)
http://euclid-archive.fr/level0/raw_20140207.fits Science
Community
SDC-xx
EAS
Database (RDB, XMLDB, OODB, …)
http://euclid-archive.ch/level0/raw_20140207.fits
Software-Component
Pipeline Task
Euclid Metadata Archive
System
EMA
SDCCH
SDC-EAS
Computing Infrastructure
Submission Host
IAL
Execution Host
Execution
Host
Storage
Submission
Host
Storage
Task
Scheduler
Task
Executor
Pipeline
Task
- 6. Dataflow in IAL Mock
SDCFR
IAL
Legend
File storage
(FTP, HTTP, File, …)
http://euclid-archive.fr/level0/raw_20140207.fits Science
Community
SDC-xx
EAS
Database (RDB, XMLDB, OODB, …)
http://euclid-archive.ch/level0/raw_20140207.fits
Software-Component
Pipeline Task
Euclid Metadata Archive
System
EMA
SDCCH
SDC-EAS
Computing Infrastructure
Submission Host
file://data/sdc-eas/level0/raw_20140207.fits
IAL
Execution Host
Execution
Host
Storage
Submission
Host
Storage
sftp://data/sub_workspace/level0/raw_20140207.fits
Task
Scheduler
Task
Executor
Pipeline
Task
- 7. Dataflow in IAL Mock
SDCFR
IAL
Legend
File storage
(FTP, HTTP, File, …)
http://euclid-archive.fr/level0/raw_20140207.fits Science
Community
SDC-xx
EAS
Database (RDB, XMLDB, OODB, …)
http://euclid-archive.ch/level0/raw_20140207.fits
Software-Component
Pipeline Task
Euclid Metadata Archive
System
EMA
SDCCH
SDC-EAS
Computing Infrastructure
Submission Host
file://data/sdc-eas/level0/raw_20140207.fits
IAL
sftp://data/sub_workspace/level0/raw_20140207.fits
Execution Host
Execution
Host
Storage
Submission
Host
Storage
file://data/sub_host/level0/raw_20140207.fits
file://mnt/exec_workspace/level0/raw_2014
Task
Scheduler
Task
Executor
Pipeline
Task
- 8. Dataflow in IAL Mock
SDCFR
IAL
Legend
File storage
(FTP, HTTP, File, …)
http://euclid-archive.fr/level0/raw_20140207.fits Science
Community
SDC-xx
EAS
Database (RDB, XMLDB, OODB, …)
http://euclid-archive.ch/level0/raw_20140207.fits
Software-Component
Pipeline Task
Euclid Metadata Archive
System
EMA
SDCCH
SDC-EAS
Computing Infrastructure
Submission Host
file://data/sdc-eas/level0/raw_20140207.fits
IAL
sftp://data/sub_workspace/level0/raw_20140207.fits
Submission
Host
Storage
Execution Host
Execution
Host
Storage
file://data/sub_host/level0/raw_20140207.fits
file://mnt/exec_workspace/level0/raw_2014
Task
Scheduler
file://exec_workspace/level0/raw_2014
Task
Pipeline
Executor
Task
- 9. Lessons Learnt
1a.Pretty error prone to have the correct URL at
the right time
1b.URLs need to be changed in all XML data
objects
Abstraction of file handling is required!
2. Creating three copies of a file is too much!
Reduce!
- 10. 1. File Handling Abstraction
Submission Host
IAL/COORS/…
Task
Scheduler
Execution Host
Task
Executor
Pipeline
Task
Euclid Metadata Archive
System
Euclid File Access Service (EuFAS™)
EMA
SDC-xx
IAL
SDC-xx
EAS
SDC-EAS
Execution
Host
Storage
Requirements on EuFAS:
• Lookup and retrieve files by properties (i.e unique ID)
• Replicate data on request and/or based on rules
• Add (and remove) files
• Register physical file locations in EMA
• Provide file handling framework/library for “Pipeline Tasks”
Submission
Host
Storage
- 11. 2. Reduce number of copies
File storage
(FTP, HTTP, File, …)
Science
Community
SDC-xx
IAL
Legend
SDC-xx
EAS
Database (RDB, XMLDB, OODB, …)
Software-Component
Pipeline Task
SDC
SDC-EAS
Euclid Metadata Archive
System
EMA
Computing Infrastructure
Submission Host
IAL
Execution Host
Execution
Host
Storage
Submission
Host
Storage
Task
Scheduler
Task
Executor
Pipeline
Task
- 12. 2. Reduce number of copies
File storage
(FTP, HTTP, File, …)
Science
Community
SDC-xx
IAL
Legend
SDC-xx
EAS
Database (RDB, XML-DB,
OODB, …)
Software-Component
Pipeline Task
SDC
SDC-EAS
Euclid Metadata Archive
System
EMA
Computing Infrastructure
Submission Host
IAL
Execution Host
Execution
Host
Storage
Submission
Host
Storage
Task
Scheduler
Task
Executor
Pipeline
Task
- 13. 2. Reduce number of copies
File storage
(FTP, HTTP, File, …)
Science
Community
SDC-xx
IAL
Legend
SDC-xx
EAS
Database (RDB, XMLDB, OODB, …)
Software-Component
Pipeline Task
SDC
SDC-EAS
Euclid Metadata Archive
System
EMA
Computing Infrastructure
Submission Host
IAL
Execution Host
Execution
Host
Storage
Submission
Host
Storage
Task
Scheduler
Task
Executor
Pipeline
Task
- 14. 2. Reduce number of copies
File storage
(FTP, HTTP, File, …)
Science
Community
SDC-xx
IAL
Legend
SDC-xx
EAS
Database (RDB, XMLDB, OODB, …)
Software-Component
Pipeline Task
SDC
SDC-EAS
Euclid Metadata Archive
System
EMA
Computing Infrastructure
Submission Host
IAL
Execution Host
Execution
Host
Storage
Submission
Host
Storage
Task
Scheduler
Task
Executor
Pipeline
Task
Editor's Notes
- HelloAccess to EAS from IAL. lessons learnt from IAL mockup integrationOutstanding issues
- Brief overview of termsDRMS: scheduler, sometimes used as overall term for computing infrastructure.Submission Host: typically SSH connectionExecution Host: computing nodes. In some setups the submission host is an execution host at the same time.Job. Computing entity, Task: Euclid job.
- IAL Infrastructure abstraction layer and not interface abstraction layer
- Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
- Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
- Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
- Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
- Red, green and yellow shapes are file storages (binary files)Two SDCs:SDR-FR only sketched. EAS file archive is used to access binary files.SDC-CH consists of the IAL and the computing infrastructureThe computing infrastructure is divided into submission host and execution hostIAL connects on submission host to submit jobs. The Task Scheduler provides an implementation independent abstraction.Submission host/Task scheduler distributes jobs to multiple execution hosts.Task is executed and creates results that IAL needs to register in the Euclid Metadata Archive System.EMA is in the left side.Let’s look into the data flowLet’s assume: data becomes available at SDC-FR, someone from the science community can access it through http/ftp/….IAL on SDC-CH needs the data local and triggers a replicate jobData moves to SDC-EASNow the science community can get the data from two places, once from SDC-FR and once from SDC-CH.Next IAL makes the data available to the Computing Infrastructure. For performance reasons IAL should use some local protocol (here file://), rather than HTTP to read the file. Currently, the submission host is writable through sftp.Now the file is copied to a dedicated workspace on the Submission Host Storage.Once the data is available IAL starts the Task Scheduler which triggers to the Task Executor.The Task executor may have to copy the files over to the Execution Host (Grid setup)Currently we support the file protocol, only. This might not be sufficient.File is copiedPipeline Task may need another protocol or path to get to the data.Seven different URLs!Three copy jobs
- 1b.Our XML data model contains references to binary data in quite some places. These references need to be changed or added all along the processing chain.
- Rearrangement of previous drawing: colored boxes at the bottomClients of data storages at the topNew layer (Euclid File Access Service) in between (hopefully based on some existing software)File Access Service needs to provide some functionality (by far not exhaustive).
- How to reduce the number of copies: Merge the file storages.Red circle: Merge SDC-EAS and Submission host. IAL would not have to copy files anymore. Requires a fast connection from the computing infrastructure to the archive system.Green circle:Merge Submission and Execution host storage, Task Executor does not need to copy files aroundRequires parallel write file system and fast network connection.Purple circle:Merge SDC-EAS, Submission and Execution Host StorageBest solution regarding performance. Puts high demands on Archive system:Support for parallel writes, support for access over the internet, fast network connection between computing infrastructure and archive, petabytes of storage
- How to reduce the number of copies: Merge the file storages.Red circle: Merge SDC-EAS and Submission host. IAL would not have to copy files anymore. Requires a fast connection from the computing infrastructure to the archive system.Green circle:Merge Submission and Execution host storage, Task Executor does not need to copy files aroundRequires parallel write file system and fast network connection.Purple circle:Merge SDC-EAS, Submission and Execution Host StorageBest solution regarding performance. Puts high demands on Archive system:Support for parallel writes, support for access over the internet, fast network connection between computing infrastructure and archive, petabytes of storage
- How to reduce the number of copies: Merge the file storages.Red circle: Merge SDC-EAS and Submission host. IAL would not have to copy files anymore. Requires a fast connection from the computing infrastructure to the archive system.Green circle:Merge Submission and Execution host storage, Task Executor does not need to copy files aroundRequires parallel write file system and fast network connection.Purple circle:Merge SDC-EAS, Submission and Execution Host StorageBest solution regarding performance. Puts high demands on Archive system:Support for parallel writes, support for access over the internet, fast network connection between computing infrastructure and archive, petabytes of storage
- How to reduce the number of copies: Merge the file storages.Red circle: Merge SDC-EAS and Submission host. IAL would not have to copy files anymore. Requires a fast connection from the computing infrastructure to the archive system.Green circle:Merge Submission and Execution host storage, Task Executor does not need to copy files aroundRequires parallel write file system and fast network connection.Purple circle:Merge SDC-EAS, Submission and Execution Host StorageBest solution regarding performance. Puts high demands on Archive system:Support for parallel writes, support for access over the internet, fast network connection between computing infrastructure and archive, petabytes of storage