Global Interpreter Lock: Episode I - Break the Seal
- 2. PyCon APAC 2015
Introduction
• Global Interpreter Lock[1]
• giant lock[2]
• GIL in CPython[5] protects:
• interpreter state, thread state, ...
• reference count
• “a guarantee”
2
• other implementations
• fine-grained lock[3]
• lock-free[4]
some CPython features and extensions
depend on the agreement
- 3. PyCon APAC 2015
GIL over Multi-Processor[6]
We want to produce efficient program.
To achieve higher throughputs, we usually divide a program
into several independent logic segments and execute them
simultaneously over MP architecture by leveraging multi-
threading technology.
Unfortunately, only one of the threads gets executed at a time
if they compete for a same GIL.
Some people are working on how to remove the giant lock
which shall be a difficult job[7][8][9]. Before the wonderful world
comes, we will need to learn how to live along with GIL well.
3
- 4. PyCon APAC 2015
Brainless Solution
multi-process
• Embarrassingly parallel[10]
• no dependency between those parallel tasks
• IPC[11]-required parallel task
• share states with other peers
• Examples:
• multiprocessing[12], pp[13], pyCSP[14]
4
- 5. PyCon APAC 2015
Example[15]
multiprocessing: process pool
5
1 import os
2 from multiprocessing import Pool
3
4 def worker(i):
5 print 'pid=%d ppid=%d i=%d' % (os.getpid(), os.getppid(), i)
6
7 print 'pid=%d' % os.getpid()
8 pool = Pool(processes=4)
9 pool.map(worker, xrange(10))
10 pool.terminate()
Round 1:
pid=11326
pid=11327 ppid=11326 i=0
pid=11328 ppid=11326 i=1
pid=11328 ppid=11326 i=3
pid=11329 ppid=11326 i=2
pid=11329 ppid=11326 i=5
pid=11329 ppid=11326 i=6
pid=11329 ppid=11326 i=7
pid=11329 ppid=11326 i=8
pid=11327 ppid=11326 i=4
pid=11328 ppid=11326 i=9
nondeterministic[16]:
the same input, different output
Round 2:
pid=11372
pid=11373 ppid=11372 i=0
pid=11373 ppid=11372 i=2
pid=11374 ppid=11372 i=1
pid=11376 ppid=11372 i=3
pid=11374 ppid=11372 i=4
pid=11374 ppid=11372 i=7
pid=11373 ppid=11372 i=6
pid=11376 ppid=11372 i=8
pid=11375 ppid=11372 i=5
pid=11375 ppid=11372 i=9
- 6. PyCon APAC 2015
Example
multiprocessing: further observations (1/2)
6
=> What if I create the target function after the pool initialized?
1 import os
2 from multiprocessing import Pool
3
4 print 'pid=%d' % os.getpid()
5 pool = Pool(processes=4)
6
7 def worker(i):
8 print 'pid=%d ppid=%d i=%d' % (os.getpid(), os.getppid(), i)
9
10 pool.map(worker, xrange(10))
11 pool.terminate()
• Adopts un-named pipe to handle IPC
• Workers are forked when initializing the pool
• so that workers can “see” the target function (they
will share the same memory copy)
- 7. PyCon APAC 2015
Example
multiprocessing: further observations (2/2)
7
Output:
pid=12093
Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
Process PoolWorker-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
...ignored...
AttributeError: 'module' object has no attribute 'worker'
...ignored...
pid=12101 ppid=12093 i=4
pid=12101 ppid=12093 i=5
pid=12101 ppid=12093 i=6
pid=12101 ppid=12093 i=7
pid=12101 ppid=12093 i=8
pid=12101 ppid=12093 i=9
^CProcess PoolWorker-6:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
task = get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
racquire()
KeyboardInterrupt
lost 0~3
process hanging
ctrl+c pressed
worker #6
#1~4 were terminated due to the exception
following workers will be forked
- 8. PyCon APAC 2015
Example
overhead of IPC and GIL battle[17]
comparison
8
1 import time
2 from multiprocessing import Process
3 from threading import Thread
4 from multiprocessing import Queue as MPQ
5 from Queue import Queue
6
7 MAX = 1000000
8
9 def test_(w_class, q_class):
10 def worker(queue):
11 for i in xrange(MAX):
12 queue.put(i)
13
14 q = q_class()
15 w = w_class(target=worker, args=(q,))
16
17 begin = time.time()
18 w.start()
19 for i in xrange(MAX):
20 q.get()
21 w.join()
22 end = time.time()
23
24 return end - begin
26 def test_sthread():
27 q = Queue()
28
29 begin = time.time()
30 for i in xrange(MAX):
31 q.put(i)
32 q.get()
33 end = time.time()
34
35 return end - begin
36
37 print 'mprocess: %.6f' % test_(Process, MPQ)
38 print 'mthread: %.6f' % test_(Thread, Queue)
39 print 'sthread: %.6f' % test_sthread()
Output:
mprocess: 14.225408
mthread: 7.759567
sthread: 2.743325
API of multiprocessing is similar to threading[18]
IPC is the most costly
overhead of the GIL battle
- 9. PyCon APAC 2015
Example
pp remote node
9
Server:
$ ppserver.py -w 1 -p 10000 &
[1] 16512
$ ppserver.py -w 1 -p 10001 &
[2] 16514
$ ppserver.py -w 1 -p 10002 &
[3] 16516
$ netstat -nlp
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 16512/python
tcp 0 0 0.0.0.0:10001 0.0.0.0:* LISTEN 16514/python
tcp 0 0 0.0.0.0:10002 0.0.0.0:* LISTEN 16516/python
$ pstree -p $$
bash(11971)-+-ppserver.py(16512)---python(16513)
|-ppserver.py(16514)---python(16515)
|-ppserver.py(16516)---python(16517)
`-pstree(16547)
# of workers listen to wait remote jobs
workers
- 10. PyCon APAC 2015
Example
pp local node
10
Output:
pid=16633
pid=16634 ppid=16633 i=0
pid=16513 ppid=16512 i=1
pid=16517 ppid=16516 i=2
pid=16515 ppid=16514 i=3
pid=16513 ppid=16512 i=4
pid=16517 ppid=16516 i=5
pid=16515 ppid=16514 i=6
pid=16634 ppid=16633 i=7
pid=16517 ppid=16516 i=8
pid=16513 ppid=16512 i=9
1 import os
2 import pp
3 import time
4 import random
5
6 print 'pid=%d' % os.getpid()
7
8 def worker(i):
9 print 'pid=%d ppid=%d i=%d' % (os.getpid(), os.getppid(), i)
10 time.sleep(random.randint(1, 3))
11
12 servers = ('127.0.0.1:10000', '127.0.0.1:10001', '127.0.0.1:10002')
13 job_server = pp.Server(1, ppservers=servers)
14
15 jobs = list()
16 for i in xrange(10):
17 job = job_server.submit(worker, args=(i,), modules=('time', 'random'))
18 jobs.append(job)
19
20 for job in jobs:
21 job()
# of workerspp worker collects stdout
determine the result order (deterministic) accumulative,
beware of RSIZE of remote node
A pp local node is an execution node too. It dispatches jobs to itself first.
computed by local node
- 11. PyCon APAC 2015
Example
ppserver.py gives some exceptions
11
Exception:
Exception in thread client_socket:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/bin/ppserver.py", line 176, in crun
ctype = mysocket.receive()
File "/usr/local/lib/python2.7/dist-packages/pptransport.py", line 196, in receive
raise RuntimeError("Socket connection is broken")
RuntimeError: Socket connection is broken
Don’t worry. Expected.
- 12. PyCon APAC 2015
Release the GIL
• Especially suitable for processor-bound tasks
• Examples:
• ctypes[19]
• Python/C extension[20][21]
• Cython[22]
• Pyrex[23]
12
- 13. PyCon APAC 2015
Example
ctypes (1/2)
13
3 duration = 10
4
5 def internal_busy():
6 import time
7
8 count = 0
9 begin = time.time()
10 while True:
11 if time.time() - begin > duration:
12 break
13 count += 1
14
15 print 'internal_busy(): count = %u' % count
16
17 def external_busy():
18 from ctypes import CDLL
19 from ctypes import c_uint, c_void_p
20
21 libbusy = CDLL('./busy.so')
22 busy_wait = libbusy.busy_wait
23 busy_wait.argtypes = [c_uint]
24 busy_wait.restype = c_void_p
25
26 busy_wait(duration)
27
28 print 'two internal busy threads, CPU utilization cannot over 100%'
29 t1 = threading.Thread(target=internal_busy); t1.start()
31 t2 = threading.Thread(target=internal_busy); t2.start()
33 t1.join(); t2.join()
35
36 print 'with one external busy thread, CPU utilization gains to 200%'
37 t1 = threading.Thread(target=internal_busy); t1.start()
39 t2 = threading.Thread(target=external_busy); t2.start()
41 t1.join(); t2.join()
6 void busy_wait(unsigned int duration)
7 {
8 uint64_t count = 0;
9 time_t begin = time(NULL);
10
11 while(1) {
12 if(time(NULL) - begin > duration)
13 break;
14 count++;
15 }
16
17 printf("busy_wait(): count = %" PRIu64 "n", count);
18 }
consume CPU resource
specify input/output types
(strongly recommended)
- 14. PyCon APAC 2015
Example
ctypes (2/2)
14
Output:
two internal busy threads, CPU utilization cannot over 100%
internal_busy(): count = 12911610
internal_busy(): count = 16578663
with one external busy thread, CPU utilization gains to 200%
internal_busy(): count = 45320393
busy_wait(): count = 3075909775
Atop Display:
CPU | sys 46% | user 72% | irq 0% | idle 82% | wait 0% |
cpu | sys 26% | user 39% | irq 1% | idle 35% | cpu001 w 0% |
cpu | sys 20% | user 33% | irq 0% | idle 46% | cpu000 w 1% |
Atop Display:
CPU | sys 1% | user 199% | irq 0% | idle 0% | wait 0% |
cpu | sys 1% | user 99% | irq 0% | idle 0% | cpu000 w 0% |
cpu | sys 0% | user 100% | irq 0% | idle 0% | cpu001 w 0% |
- 15. PyCon APAC 2015
Example
Python/C extension (1/3)
15
20 static PyObject *with_lock(PyObject *self, PyObject *args)
21 {
22 unsigned int duration;
23
24 if(!PyArg_ParseTuple(args, "I", &duration))
25 return NULL;
26
27 busy_wait(duration);
28
29 Py_INCREF(Py_None);
30 return Py_None;
31 }
32
33 static PyObject *without_lock(PyObject *self, PyObject *args)
34 {
35 unsigned int duration;
36
37 if(!PyArg_ParseTuple(args, "I", &duration))
38 return NULL;
39
40 PyThreadState *_save;
41 _save = PyEval_SaveThread();
42 busy_wait(duration);
43 PyEval_RestoreThread(_save);
44
45 Py_INCREF(Py_None);
46 return Py_None;
47 }
48
49 static PyMethodDef busy_methods[] = {
50 {"with_lock", with_lock, METH_VARARGS, "Busy wait for a given duration with GIL"},
51 {"without_lock", without_lock, METH_VARARGS, "Busy wait for a given duration without GIL"},
52 {NULL, NULL, 0, NULL}
53 };
54
55 PyMODINIT_FUNC initbusy(void)
56 {
57 if(Py_InitModule("busy", busy_methods) == NULL)
58 return PyErr_SetString(PyExc_RuntimeError, "failed to Py_InitModule");
59 }
release the GIL before being busy
exported symbol name
require an unsigned integer
argument (busy duration)
return None
Compilation:
$ cat Makefile
busy.so: busy.c
$(CC) -o $@ -fPIC -shared -I/usr/include/python2.7 busy.c
$ make
accept positional args.
module name
- 16. PyCon APAC 2015
Example
Python/C extension (2/3)
16
1 import threading
2
3 duration = 10
4
5 def internal_busy():
6 import time
7
8 count = 0
9 begin = time.time()
10 while True:
11 if time.time() - begin > duration:
12 break
13 count += 1
14
15 print 'internal_busy(): count = %u' % count
16
17 def external_busy_with_lock():
18 from busy import with_lock
19
20 with_lock(duration)
21
22 def external_busy_without_lock():
23 from busy import without_lock
24
25 without_lock(duration)
26
27 print 'two busy threads compete for GIL, CPU utilization cannot over 100%'
28 t1 = threading.Thread(target=internal_busy); t1.start()
30 t2 = threading.Thread(target=external_busy_with_lock); t2.start()
32 t1.join(); t2.join()
34
35 print 'with one busy thread released GIL, CPU utilization gains to 200%'
36 t1 = threading.Thread(target=internal_busy); t1.start()
38 t2 = threading.Thread(target=external_busy_without_lock); t2.start()
40 t1.join(); t2.join()
linking to the busy.so extension
- 17. PyCon APAC 2015
Example
Python/C extension (3/3)
17
Output:
two busy threads compete for GIL, CPU utilization cannot over 100%
busy_wait(): count = 3257960533
internal_busy(): count = 45524
with one busy thread released GIL, CPU utilization gains to 200%
internal_busy(): count = 48049276
busy_wait(): count = 3271300229
Atop Display:
CPU | sys 2% | user 100% | irq 0% | idle 99% | wait 0% |
cpu | sys 0% | user 100% | irq 0% | idle 0% | cpu001 w 0% |
cpu | sys 1% | user 0% | irq 0% | idle 99% | cpu000 w 0% |
Atop Display:
CPU | sys 2% | user 198% | irq 0% | idle 0% | wait 0% |
cpu | sys 0% | user 100% | irq 0% | idle 0% | cpu000 w 0% |
cpu | sys 1% | user 98% | irq 0% | idle 0% | cpu001 w 0% |
- 18. PyCon APAC 2015
Cooperative Multitasking
• Only applicable to IO-bound tasks
• Single process, single thread
• no other thread, no GIL battle
• Executing the code when exactly needed
• Examples:
• generator
[24]
• pyev
[25]
• gevent
[26]
18
- 19. PyCon APAC 2015
Example
pyev
19
1 import pyev
2 import signal
3 import sys
4
5 def alarm_handler(watcher, revents):
6 sys.stdout.write('.')
7 sys.stdout.flush()
8
9 def timeout_handler(watcher, revents):
10 loop = watcher.loop
11 loop.stop()
12
13 def int_handler(watcher, revents):
14 loop = watcher.loop
15 loop.stop()
16
17 if __name__ == '__main__':
18 loop = pyev.Loop()
19
20 alarm = loop.timer(0.0, 1.0, alarm_handler)
21 alarm.start()
22
23 timeout = loop.timer(10.0, 0.0, timeout_handler)
24 timeout.start()
25
26 sigint = loop.signal(signal.SIGINT, int_handler)
27 sigint.start()
28
29 loop.start()
Case 1 Output:
...........
Case 2 Output:
..^C
11 dots
libev Timer:
(after)|(repeat)|(repeat)|(repeat)|...
interval event raised
the example:
after 0.0 second, raise
every 1.0 second, raise
raises 11 times in total
- 20. PyCon APAC 2015
Example
pyev: further observations
20
20 loop.timer(0.0, 1.0, alarm_handler).start()
21
22 loop.start()
Output:
Exception SystemError: 'null argument to internal routine' in Segmentation fault (core dumped)
20 timeout = loop.timer(0.0, 1.0, alarm_handler)
21 timeout.start()
22
23 timeout = loop.timer(10.0, 0.0, timeout_handler)
24 timeout.start()
25
26 loop.start()
20 alarm = loop.timer(0.0, 1.0, alarm_handler)
21 alarm.start()
22 sigint = loop.timer(10.0, 0.0, timeout_handler)
23 sigint.start()
24 sigint = loop.signal(signal.SIGINT, int_handler)
25 sigint.start()
26 loop.start()
Output:
...........Exception SystemError: 'null argument to internal routine' in Segmentation fault (core dumped)
manual of ev[27]:
you are responsible for allocating the
memory for your watcher structures
- 21. PyCon APAC 2015
Example
gevent
21
1 import gevent
2 from gevent import signal
3 import signal as o_signal
4 import sys
5
6 if __name__ == '__main__':
7 ctx = dict(stop_flag=False)
8
9 def int_handler():
10 ctx['stop_flag'] = True
11 gevent.signal(o_signal.SIGINT, int_handler)
12
13 count = 0
14 while not ctx['stop_flag']:
15 sys.stdout.write('.')
16 sys.stdout.flush()
17
18 gevent.sleep(1)
19
20 count += 1
21 if count > 10:
22 break
Case 1 Output:
...........
Case 2 Output:
..^C
- 22. PyCon APAC 2015
Interpreter as an Instance
• Rough idea, not a concrete solution yet
• C program, single process, multi-thread
• still can share states with relatively low penalty
• Allocate memory space for interpreter context
• that is, accept an address to put instance context
in Py_Initialize()
22
- 23. PyCon APAC 2015
Conclusion
• How to live along with GIL well?
• Multi-process
• Release the GIL
• Cooperative Multitasking
• Perhaps, Interpreter as an Instance
23
- 24. PyCon APAC 2015
References
[1]: http://en.wikipedia.org/wiki/Global_Interpreter_Lock
[2]: http://en.wikipedia.org/wiki/Giant_lock
[3]: http://en.wikipedia.org/wiki/Fine-grained_locking
[4]: http://en.wikipedia.org/wiki/Non-blocking_algorithm
[5]: https://wiki.python.org/moin/GlobalInterpreterLock
[6]: http://en.wikipedia.org/wiki/Multiprocessing
[7]: https://docs.python.org/2/faq/library.html#can-t-we-get-rid-of-the-global-interpreter-lock
[8]: http://www.artima.com/weblogs/viewpost.jsp?thread=214235
[9]: http://dabeaz.blogspot.tw/2011/08/inside-look-at-gil-removal-patch-of.html
[10]: http://en.wikipedia.org/wiki/Embarrassingly_parallel
[11]: http://en.wikipedia.org/wiki/Inter-process_communication
[12]: https://docs.python.org/2/library/multiprocessing.html
[13]: http://www.parallelpython.com/
[14]: https://code.google.com/p/pycsp/
[15]: https://github.com/penvirus/gil1
[16]: http://en.wikipedia.org/wiki/Nondeterministic_algorithm
[17]: http://www.dabeaz.com/python/GIL.pdf
[18]: https://docs.python.org/2/library/threading.html
[19]: https://docs.python.org/2/library/ctypes.html
[20]: https://docs.python.org/2/c-api/
[21]: https://docs.python.org/2/c-api/init.html#releasing-the-gil-from-extension-code
[22]: http://cython.org/
[23]: http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/
[24]: http://www.dabeaz.com/coroutines/Coroutines.pdf
[25]: http://pythonhosted.org/pyev/
[26]: http://www.gevent.org/
[27]: http://linux.die.net/man/3/ev
24