Customize and Secure the Runtime and Dependencies of Your Procedural Languages Using PL/Container - Greenplum Summit 2018

© Copyright 2017 Pivotal Software, Inc. All rights Reserved. Version 1.0
Hubert Zhang (hzhang@pivotal.io)
Jack WU (jwu@pivotal.io)
PL/Container
Introduction
Customize and Secure the Runtime and
Dependencies of Procedural Languages

Cover w/ Image
Agenda
■ The Problem
■ What is PL/Container
■ How to use PL/Container
■ PL/Container Internals
■ Future Work
■ Q+A

The Problem
PGConf 2018: PL/Container Introduction

We generate More and More Data

We generate More and More Data
1.2
ZB 2.8ZB 8.5ZB
40ZB2010
2012
2015
20201ZB=1,000,000,000,000,000,000,000 Byte

We want to analyze data for knowledge

We want to analyse data IN Database

PL/Python and PL/R are UNTRUSTED
Languages

Only Superuser can Create UDF in Untrusted
Languages
System(“rm -rf /data”)

The Problem: Triangle Dependency
Data Scientist
DBA
UDF
Review & Create
Run UDF
Greenplum
Package
UDFPackage
1. Greenplum
2. Operation System
3. Python / R
4. TensorFlow

Resolve The Problem: untrusted -> untrusted
Data Scientist
DBA
UDF
Review & Create
Run UDF
Create UDF

How to Make untrusted to untrusted?
PL/Container
PL/Container

What is PL/Container

What is PL/Container?
PL/Container is a customizable, secure
runtime for Greenplum Database Procedural
Languages.
● Greenplum Database Extension
● Stateless
● Based on Docker Container
● Customizable
● Secure
● Isolated
PL/Container
PL/Container

6.Ping
7.Ping
8.Call
9.SPI
10.Result
11.Result
PL/Container Architecture
GPDB Segment host
1. Query plan
2. Parse container
name from UDF body
3. Read configuration
4. Bring up container 5. Start
Repeat
p.8–p.11
12. Result Repeat p.12

How UDF run in PL/Container?
PL/Container is a customizable, secure runtime
for Greenplum Database Procedural Languages.
a. PL/Container Extension starts a docker
container (only in 1st call)
b. Transfer UDF and data to docker container
c. Run the UDF in docker container
d. Contact the docker container to get the results
PL/Container
PL/Container

How to use PL/Container

Install PL/Container on Greenplum
Install from Source Code
● source $GPHOME/greenplum_path.sh
● make install
Install from GPPKG
● gppkg –i plcontianer-1.1.0-rhel7-x86_64.gppkg
● no additional dependencies
Platform
Centos 6.6+ or 7.x
Database
Greenplum 5.2+
Docker
Docker 17.05+ on Centos7
Docker 1.7+ on Centos6
Prerequisites

Build Custom Docker Image (optional)
Minimum Requirement:
● Python or R environment
● Add location of libpython.so and libR.so to LD_LIBRARY_PATH
Customize Your image:
● Install specific packages
FROM continuumio/anaconda
ENV LD_LIBRARY_PATH "/opt/conda/lib:$LD_LIBRARY_PATH”
FROM continuumio/anaconda3
RUN conda install -c conda-forge -y tensorflow
ENV LD_LIBRARY_PATH "/opt/conda/lib:/usr/local/lib:$LD_LIBRARY_PATH"

runtime
<id>
<image>
<command>
<shared directory>
<setting
memory_mb>
<setting cpu_share>
runtime add
runtime delete
runtime backup
runtime restore
runtime edit
runtime show
image add
image delete
image list
Image RuntimeXML
Configure PL/Container
container cgroup node
memory.memsw.limit_in_bytes
cpu.shares

Run PL/Container
Running a simple plpython UDF to calculate log10
postgres=# CREATE LANGUAGE plpythonu; plcontainer;
postgres=# CREATE OR REPLACE FUNCTION pylog10(input double precision) RETURNS double
precision AS $$
import math
return math.log10(input)
$$ LANGUAGE plpythonu;
postgres=# SELECT pylog10(100);
pylog10
--------------
2
(1 row)

Run PL/Container
Running a simple PL/Container UDF to calculate log10
postgres=# CREATE EXTENSION plcontainer;
postgres=# CREATE OR REPLACE FUNCTION pylog10(Input double precision) RETURNS double
precision AS $$
# container: plc_python_shared
import math
return math.log10(input)
$$ LANGUAGE plcontainer;
postgres=# SELECT pylog10(100);
pylog10
--------------
2
(1 row)

PL/Container Internal

Cover w/ Image
PL/Container Internals
■ Message Protocol
■ SPI Support
■ Pluggable Backend
■ Resource Management
■ Error Handling
■ Performance

Message Protocol
PL/Container use messages to
communicate between QEs and
containers.
● plcMsgPing
● plcMsgCallreq
● plcMsgResult
● plcMsgError
● plcMsgLog
● plcMsgSQL
● plcMsgSubtransaction
● plcMsgRaw
Query
Executor
Container
PING PONG SQLResultSQLError/LogCallReq
Result

SPI support
Server Programming Interface enable UDF to run SQL queries.
Query ExecutorContainer
plpy.execute(query)
plan=plpy.prepare
plpy.execute(plan)
SPI_execute()
create&cache plan
SPI_execute_plan
Where
Query
Generated
Where
Query
Executed
Problem: SPI is called inside container but executed at QE side.

Pluggable Backend
Docker as a sandbox Container as a serviceSeparate Process
GPDB Query Executor
C CODE
Python Executor
Python Code

Resource Management
Container level
● Memory: memory limit, minimum is 4M
● CpuShares: relative weight of CPU
Extension level
● Integrated with GPDB resource group extension
framework.
OOM

Container level
Extension level
framework.
create resource group plgroup
(concurrency=0,
cpu_rate_limit=10,
memory_limit=30,
memory_auditor=‘cgroup’)
Resource Management
OOM

Resource Management
Container level
Extension level
framework.
memory
gpdb
default_grou
p OID
plgroup oid
aefs9err8e flkr345ere4
cgroup
38G
1G 2G
create resource group plgroup
(concurrency=0,
cpu_rate_limit=10,
memory_limit=30,
memory_auditor=‘cgroup’)

Error Handling
Container failure should not affect GPDB core
● Containers fail to create
● Containers fail to start
● Containers crash when running
● Cached containers crash
Container cleanup
● Query Cancel
● QE error
● Cached QE quit when idle for a long time (By Cleanup process)

Performance
Optimization
● Cached container (lifecycle same as QE)
● Unix domain socket
● Type conversion
● Resource management (CPU share)
Best practices
● Array instead of multiple rows.
● Complex UDF instead of simple one
QE Container
start container
tuple 1
tuple 2
query 1
query 2
tuple 1
tuple n
tuple n
.
.
.
.
.

Performance
Test Environment
● Hardware: 6 virtual machines, each with 19G memory and 5 processors. (Intel(R)
Xeon(R) CPU E5-2697 v2 @ 2.70GHz)
● Software: Centos7, GPDB 5.2 with 30 segments.
Workloads
● Long-running function
● Large input array function
● Large output array function

Performance
Long-running function
CREATE OR REPLACE FUNCTION pysleep(i
int) RETURNS void AS $$
# contaziner: plc_python_shared
import time
time.sleep(i)
SELECT count(pysleep(1)) FROM tbl;
no performance downgrade

Performance
Large input array function
CREATE OR REPLACE FUNCTION
pylargeint8in(a int8[]) RETURNS float8
AS $$
#container : plc_python_shared
return sum(a)/float(len(a))
SELECT count(pylargeint8in(ARRAY(SELECT
column1 FROM tbl1))) FROM tbl2;
2 times performance downgrade for Python
30% performance improvement for R

Performance
Large output array function
CREATE OR REPLACE FUNCTION
pylargeoutfloat8(num int) RETURNS
float8[] AS $$
# container: plc_python_shared
return [x/3.0 for x in range(num)]
SELECT count(pylargeoutfloat8(n)) FROM
tbl;
4 times performance improvement

Future Work

PL/Container Future Work (subject to change)
Support More Languages Support More TechnologyContainer Orchestrater /
Cloud

https://github.com/greenplum-db/plcontainer
https://gpdb.docs.pivotal.io/570/ref_guide/extensions/pl_container.html

Customize and Secure the Runtime and Dependencies of Your Procedural Languages Using PL/Container - Greenplum Summit 2018

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Customize and Secure the Runtime and Dependencies of Your Procedural Languages Using PL/Container - Greenplum Summit 2018

Similar to Customize and Secure the Runtime and Dependencies of Your Procedural Languages Using PL/Container - Greenplum Summit 2018 (20)

More from VMware Tanzu

More from VMware Tanzu (20)

Recently uploaded

Recently uploaded (20)

Customize and Secure the Runtime and Dependencies of Your Procedural Languages Using PL/Container - Greenplum Summit 2018