Sometimes AWS Lightsail Linux instance stops responding at all
Some time ago, I wrote series of posts on how to run .NET Core app on AWS Lightsail Linux instance. Everything worked nice but sometimes, about once per month my web site stopped responding. And I cannot connect to my instance at all to diagnose that issue. All I can do is just restart my AWS instance. At the beginning I thought it could be AWS issue, or perhaps some issues in .NET. I updated everything I can, but problem persists. And when it happened last time, I decided to check kernel logs and I found this:
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216567] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/www.example.com.service,task=dotnet,pid=511,uid=1001
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216597] Out of memory: Killed process 511 (dotnet) total-vm:3007440kB, anon-rss:105260kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:712kB oom_score_adj:0
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.241642] oom_reaper: reaped process 511 (dotnet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Quick research shown that out of memory event is quite bad and most of the time, system will not survive it. Initially, I thought that I have this issue because there are many requests and system just needs a tiny bit more memory to process them all. Lightsail instances do not have swap file and I thought this is source of my problems. But I have a rule: “After I have some plausible explanation, I have to verify that there is no contradiction in all available facts”. Firstly, I found this:
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216494] Tasks state (memory values in pages):
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216495] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216497] [ 160] 0 160 27779 995 212992 0 -250 systemd-journal
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216500] [ 178] 0 178 624 145 45056 0 0 bpfilter_umh
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216503] [ 225] 0 225 2213 944 61440 0 -1000 systemd-udevd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216505] [ 323] 0 323 70052 4499 94208 0 -1000 multipathd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216507] [ 384] 102 384 22560 1009 77824 0 0 systemd-timesyn
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216509] [ 448] 100 448 6688 851 77824 0 0 systemd-network
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216512] [ 450] 101 450 5977 1768 86016 0 0 systemd-resolve
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216513] [ 484] 0 484 58611 418 90112 0 0 accounts-daemon
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216515] [ 485] 0 485 637 185 49152 0 0 acpid
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216518] [ 490] 0 490 1394 550 49152 0 0 cron
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216519] [ 493] 103 493 1975 717 53248 0 -900 dbus-daemon
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216521] [ 501] 0 501 6580 2741 90112 0 0 networkd-dispat
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216523] [ 504] 104 504 56127 515 98304 0 0 rsyslogd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216525] [ 505] 0 505 308790 1939 188416 0 0 amazon-ssm-agen
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216527] [ 508] 0 508 4207 443 73728 0 0 systemd-logind
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216529] [ 510] 0 510 98110 1010 126976 0 0 udisksd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216530] [ 511] 1001 511 751860 26315 729088 0 0 dotnet
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216532] [ 512] 0 512 950 518 45056 0 0 atd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216534] [ 532] 0 532 1098 425 49152 0 0 agetty
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216536] [ 540] 0 540 717 392 40960 0 0 agetty
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216537] [ 573] 0 573 16417 393 94208 0 0 nginx
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216539] [ 574] 33 574 16783 1130 98304 0 0 nginx
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216541] [ 605] 0 605 26286 2672 102400 0 0 unattended-upgr
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216543] [ 681] 0 681 3046 743 57344 0 -1000 sshd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216545] [ 806] 0 806 311945 3530 212992 0 0 ssm-agent-worke
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216547] [ 58199] 0 58199 183613 4356 253952 0 -900 snapd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216549] [ 126266] 0 126266 58181 398 86016 0 0 polkitd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216552] [ 840256] 0 840256 107913 14004 286720 0 0 fwupd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216554] [ 841890] 0 841890 654 121 45056 0 0 apt.systemd.dai
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216556] [ 841894] 0 841894 654 380 45056 0 0 apt.systemd.dai
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216558] [ 841925] 0 841925 77338 13629 389120 0 0 unattended-upgr
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216560] [ 842324] 107 842324 1697 179 53248 0 0 uuidd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216562] [ 843169] 0 843169 70163 1060 135168 0 0 packagekitd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216564] [ 843196] 0 843196 87834 23683 393216 0 0 unattended-upgr
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216566] [ 843232] 0 843232 87834 23118 294912 0 0 unattended-upgr
And after I add up total_vm I got only 2,477,307. It just around 2.5 megabytes. Obviously, it is not enough to trigger out of memory. If this is in kilobytes, then it is 2.5 Gb. That instance has only 512 Mb of memory, so if my theory is correct, it cannot allocate more than that.
But later I found that total_vm is in pages. Page on Intel compatible CPUs is 4096 bytes or 4 kilobytes. Then I found “process 511 (dotnet) total-vm:3007440kB”. It is clear, that dotnet allocated 3Gb and as result, I have to drop my theory about missing swap file.
It is obvious that my application leaks memory. Right solution would be to fix application, but anybody who deals with big application that is developed by big team knows that it is not easy thing to do. You fix one memory leak, and they add 2 more :) And to verify that there are no memory leaks, you will have to have tests for all possible scenarios, and it is a lot of work. Obviously, it is easier to simply restart your web site regularly or when it consumed too much memory.
For my case, I use 3rd party blogging platform and to be honest I don’t really have time to find source of that memory leak, so I chose simpler solution and decide to restart my web site regularly. And I will explain all details in next post.