I have a simple multiprocessing code:
from multiprocessing import Pool
import time
def worker(data):
time.sleep(20)
if __name__ == "__main__":
numprocs = 10
pool = Pool(numprocs)
a = ['a' for i in xrange(1000000)]
b = [a+[] for i in xrange(100)]
data1 = [b+[] for i in range(numprocs)]
data2 = [data1+[]] + ['1' for i in range(numprocs-1)]
data3 = [['1'] for i in range(numprocs)]
#data = data1
#data = data2
data = data3
result = pool.map(worker,data)
b is just a large list. data is a list of length numprocs passed to pool.map, so I expect numprocs processes to be forked and each element of data to be passed to one of those.
I test 3 different data objects: data1 and data2 have practically the same size, but when using data1, each process gets a copy of the same object, whereas when using data2, one process gets all of data1 and others get just a '1' (basically nothing). data3 is basically empty to measure the basic overhead cost of forking processes.
Problem:
The overall memory used is vastly different between data1 and data2. I measure the amount of additional memory used by the last line (pool.map()) and I get:
data1: ~8GBdata2: ~0.8GBdata3: ~0GB
Shouldn't 1) and 2) be equal because the total amount of data passed to the children is the same. What is going on?
I measure memory usage from the Active field of /proc/meminfo on a Linux machine (Total - MemFree gives the same answer)