multiprocessing in python - what gets inherited by forkserver process from parent process?
I am trying to use forkserver
and I encountered NameError: name 'xxx' is not defined
in worker processes.
I am using Python 3.6.4, but the documentation should be the same, from https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods it says that:
The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
Also, it says:
Better to inherit than pickle/unpickle
Spawn veya forkserver başlatma yöntemlerini kullanırken, çoklu işlemeden birçok türün seçilebilir olması gerekir, böylece alt süreçler bunları kullanabilir . Ancak, paylaşılan nesnelerin diğer işlemlere boru veya kuyruklar kullanılarak gönderilmesinden genellikle kaçınılmalıdır. Bunun yerine programı, başka bir yerde oluşturulan paylaşılan bir kaynağa erişime ihtiyaç duyan bir işlemin onu bir üst süreçten devralabilmesi için düzenlemelisiniz.
Öyleyse görünüşe göre, çalışan sürecimin üzerinde çalışması gereken önemli bir nesne, sunucu işlemi tarafından miras alınmadı ve sonra çalışanlara geçti, bu neden oldu? Ana süreçten çatal sunucu süreci tarafından tam olarak neyin miras alındığını merak ediyorum.
Kodum şöyle görünüyor:
import multiprocessing
import (a bunch of other modules)
def worker_func(nameList):
global largeObject
for item in nameList:
# get some info from largeObject using item as index
# do some calculation
return [item, info]
if __name__ == '__main__':
result = []
largeObject # This is my large object, it's read-only and no modification will be made to it.
nameList # Here is a list variable that I will need to get info for each item in it from the largeObject
ctx_in_main = multiprocessing.get_context('forkserver')
print('Start parallel, using forking/spawning/?:', ctx_in_main.get_context())
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=4) as pool:
for x in pool.imap_unordered(worker_func, nameList):
result.append(x)
Teşekkür ederim!
En iyi,
Yanıtlar
Teori
Aşağıda Bojan Nikolic blogundan bir alıntı var
Modern Python versions (on Linux) provide three ways of starting the separate processes:
Fork()-ing the parent processes and continuing with the same processes image in both parent and child. This method is fast, but potentially unreliable when parent state is complex
Spawning the child processes, i.e., fork()-ing and then execv to replace the process image with a new Python process. This method is reliable but slow, as the processes image is reloaded afresh.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
Forkserver
The third method, forkserver, is illustrated below. Note that children retain a copy of the forkserver state. This state is intended to be relatively simple, but it is possible to adjust this through the multiprocess API through the
set_forkserver_preload()
method.
Practice
Thus, if you want simething to be inherited by child processes from the parent, this must be specified in the forkserver state by means of set_forkserver_preload(modules_names)
, which set list of module names to try to load in forkserver process. I give an example below:
# inherited.py
large_obj = {"one": 1, "two": 2, "three": 3}
# main.py
import multiprocessing
import os
from time import sleep
from inherited import large_obj
def worker_func(key: str):
print(os.getpid(), id(large_obj))
sleep(1)
return large_obj[key]
if __name__ == '__main__':
result = []
ctx_in_main = multiprocessing.get_context('forkserver')
ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
with ctx_in_main.Pool(processes=cores) as pool:
for x in pool.imap(worker_func, ["one", "two", "three"]):
result.append(x)
for res in result:
print(res)
Output:
# The PIDs are different but the address is always the same
PID=18603, obj id=139913466185024
PID=18604, obj id=139913466185024
PID=18605, obj id=139913466185024
And if we don't use preloading
...
ctx_in_main = multiprocessing.get_context('forkserver')
# ctx_in_main.set_forkserver_preload(['inherited'])
cores = ctx_in_main.cpu_count()
...
# The PIDs are different, the addresses are different too
# (but sometimes they can coincide)
PID=19046, obj id=140011789067776
PID=19047, obj id=140011789030976
PID=19048, obj id=140011789030912
So after an inspiring discussion with Alex I think I have sufficient info to address my question: what exactly gets inherited by forkserver process from parent process?
Temel olarak sunucu işlemi başladığında, ana modülünüzü içe aktaracak ve önceki her şey if __name__ == '__main__'
çalıştırılacaktır. Bu benim kod değil çalışmayı neden çünkü var large_object
bulunabilir hiçbir yerde server
bu alt işlemlerin süreç ve tüm gelen çatal o server
süreçte .
Alex'in çözümü işe yarıyor çünkü large_object
artık hem ana hem de sunucu işlemine aktarılıyor, böylece sunucudan çatallanan her çalışan da alacak large_object
. set_forkserver_preload(modules_names)
Tüm işçilerle birleştirilirse, gördüklerimden aynı large_object
şeyi bile alabilirim . Kullanım nedeni forkserver
Python belgelerinde ve Bojan'ın blogunda açıkça açıklanmıştır:
When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.
The forkserver mechanism, which consists of a separate Python server with that has a relatively simple state and which is fork()-ed when a new processes is needed. This method combines the speed of Fork()-ing with good reliability (because the parent being forked is in a simple state).
So it's more on the safe side of concern here.
On a side note, if you use fork
as the starting method though, you don't need to import anything since all child process gets a copy of parents process memory (or a reference if the system use COW-copy-on-write
, please correct me if I am wrong). In this case using global large_object
will get you access to large_object
in worker_func
directly.
The forkserver
might not be a suitable approach for me because the issue I am facing is memory overhead. All the operations that gets me large_object
in the first place are memory-consuming, so I don't want any unnecessary resources in my worker processes.
If I put all those calculations directly into inherited.py
as Alex suggested, it will be executed twice (once when I imported the module in main and once when the server imports it; maybe even more when worker processes were born?), this is suitable if I just want a single-threaded safe process that workers can fork from. But since I am trying to get workers to not inherit unnecessary resources and only get large_object
, this won't work. And putting those calculations in __main__
in inherited.py
won't work either since now none of the processes will execute them, including main and server.
So, as a conclusion, if the goal here is to get workers to inherit minimal resources, I am better off breaking my code into 2, do calculation.py
first, pickle the large_object
, exit the interpreter, and start a fresh one to load the pickled large_object
. Then I can just go nuts with either fork
or forkserver
.