Problem
How much load is put on an operating system when your application create and destroys a large number of processes.
Simulation
To simulate a condition when an application creates a massive number of processes I'm going to use a parallel tool. With a help of it we will run a test to simulate multiple creation and destroying of processes.
Basic pre-configuration
# we don't want any IO to the disk [1] # mount -t tmpfs -o size=10m tmpfs /tmp/tmpfs.d/
The command below is going to simulate a condition where 100 processes will be started simultaneously Every process lives only for a short time before it finishes and dies. Once a process is terminated a new one will be started to replace it.
# cd /tmp/tmpfs.d/ # time seq 1 10000 | parallel --joblog /tmp/tmpfs.d/joblog.txt -j100 echo '{} $(date)' ">" log.{}.txt real 4m33.859s user 0m28.870s sys 1m39.590s
We can monitor the resource utilization on the system with a help of these 2 simple commands.
Number of process
# while true; do ps -AHf | wc -l ; sleep .1; done | uniq 178 176 177 176 ...
CPU utilization
# top | grep Cpu ... # the average results were Cpu(s): 23.4%us, 76.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st ... # the stats during the last seconds Cpu(s): 21.5%us, 78.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu(s): 26.2%us, 73.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu(s): 24.8%us, 75.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu(s): 25.1%us, 74.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu(s): 23.1%us, 76.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu(s): 22.3%us, 77.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu(s): 22.8%us, 77.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu(s): 22.4%us, 77.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu(s): 12.3%us, 42.9%sy, 0.0%ni, 44.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st Cpu(s): 0.3%us, 2.3%sy, 0.0%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu(s): 1.0%us, 2.3%sy, 0.0%ni, 96.3%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu(s): 0.3%us, 2.0%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu(s): 0.3%us, 2.0%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Analysis
The above stats show that the CPU was primary busy executing operating system level code. The user and sys times confirms that only for 28s the processor was executing a user level code. Majority of the time the Linux kernel was performing all necessary tasks to create or destroy of processes.
Looking at the system calls we can see that the most expensive calls were 'clone' and 'close'. Interestingly the 'close' took longer then 'clone' but this may be truth because the process we were staring over and over was only 'echo
# strace -r -o strace.r.txt parallel --ungroup -j1 echo '{} $(date)' ">" log.{}.txt ::: 1 2
Syscall relative timestamps
cat strace.r.txt | egrep -i -A1 '0\.00[2-9]' -- 0.004394 --- SIGCHLD (Child exited) @ 0 (0) --- 0.000042 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7fdaa838d4c0}, NULL, 8) = 0 -- 0.003015 close(7) = 0 0.000076 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7fdaa838d4c0}, {SIG_DFL, [], SA_RESTORER, 0x7fdaa838d4c0}, 8) = 0 -- 0.005035 --- SIGCHLD (Child exited) @ 0 (0) --- 0.000043 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7fdaa838d4c0}, NULL, 8) = 0 -- 0.002960 close(7) = 0 0.000074 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7fdaa838d4c0}, {SIG_DFL, [], SA_RESTORER, 0x7fdaa838d4c0}, 8) = 0 -- 0.003542 --- SIGCHLD (Child exited) @ 0 (0) --- 0.000042 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7fdaa838d4c0}, NULL, 8) = 0 # truncated
# cat strace.r.txt | egrep -i -A1 'clone' 0.000102 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdaa8ca89d0) = 25086 0.003318 close(7) = 0 -- 0.000085 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdaa8ca89d0) = 25087 0.002620 close(7) = 0 -- 0.000085 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdaa8ca89d0) = 25088 0.002617 close(7) = 0 -- 0.000097 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdaa8ca89d0) = 25089 0.003037 close(7) = 0 # truncated
# strace -c parallel --ungroup -j1 echo '{} $(date)' log.{}.txt ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 72.01 0.108059 2573 42 clone ### on average 65% 26.97 0.040474 33 1211 3 wait4 0.92 0.001379 1 1152 1 select 0.02 0.000033 1 33 chmod 0.02 0.000029 0 268 12 open 0.02 0.000025 1 48 gettimeofday 0.02 0.000024 0 498 73 lseek 0.02 0.000024 0 233 1 rt_sigaction 0.02 0.000024 0 294 fcntl 0.00 0.000000 0 213 read 0.00 0.000000 0 1 write 0.00 0.000000 0 401 close ... ------ ----------- ----------- --------- --------- ---------------- 100.00 0.150071 5978 975 total
References
No comments:
Post a Comment