Sometimes AWS Lightsail Linux instance stops responding at all

Some time ago, I wrote series of posts on how to run .NET Core app on AWS Lightsail Linux instance. Everything worked nice but sometimes, about once per month my web site stopped responding. And I cannot connect to my instance at all to diagnose that issue. All I can do is just restart my AWS instance. At the beginning I thought it could be AWS issue, or perhaps some issues in .NET. I updated everything I can, but problem persists. And when it happened last time, I decided to check kernel logs and I found this:

Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216567] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/www.example.com.service,task=dotnet,pid=511,uid=1001
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216597] Out of memory: Killed process 511 (dotnet) total-vm:3007440kB, anon-rss:105260kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:712kB oom_score_adj:0
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.241642] oom_reaper: reaped process 511 (dotnet), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Quick research shown that out of memory event is quite bad and most of the time, system will not survive it. Initially, I thought that I have this issue because there are many requests and system just needs a tiny bit more memory to process them all. Lightsail instances do not have swap file and I thought this is source of my problems. But I have a rule: “After I have some plausible explanation, I have to verify that there is no contradiction in all available facts”. Firstly, I found this:

Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216494] Tasks state (memory values in pages):
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216495] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216497] [    160]     0   160    27779      995   212992        0          -250 systemd-journal
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216500] [    178]     0   178      624      145    45056        0             0 bpfilter_umh
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216503] [    225]     0   225     2213      944    61440        0         -1000 systemd-udevd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216505] [    323]     0   323    70052     4499    94208        0         -1000 multipathd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216507] [    384]   102   384    22560     1009    77824        0             0 systemd-timesyn
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216509] [    448]   100   448     6688      851    77824        0             0 systemd-network
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216512] [    450]   101   450     5977     1768    86016        0             0 systemd-resolve
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216513] [    484]     0   484    58611      418    90112        0             0 accounts-daemon
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216515] [    485]     0   485      637      185    49152        0             0 acpid
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216518] [    490]     0   490     1394      550    49152        0             0 cron
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216519] [    493]   103   493     1975      717    53248        0          -900 dbus-daemon
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216521] [    501]     0   501     6580     2741    90112        0             0 networkd-dispat
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216523] [    504]   104   504    56127      515    98304        0             0 rsyslogd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216525] [    505]     0   505   308790     1939   188416        0             0 amazon-ssm-agen
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216527] [    508]     0   508     4207      443    73728        0             0 systemd-logind
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216529] [    510]     0   510    98110     1010   126976        0             0 udisksd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216530] [    511]  1001   511   751860    26315   729088        0             0 dotnet
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216532] [    512]     0   512      950      518    45056        0             0 atd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216534] [    532]     0   532     1098      425    49152        0             0 agetty
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216536] [    540]     0   540      717      392    40960        0             0 agetty
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216537] [    573]     0   573    16417      393    94208        0             0 nginx
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216539] [    574]    33   574    16783     1130    98304        0             0 nginx
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216541] [    605]     0   605    26286     2672   102400        0             0 unattended-upgr
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216543] [    681]     0   681     3046      743    57344        0         -1000 sshd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216545] [    806]     0   806   311945     3530   212992        0             0 ssm-agent-worke
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216547] [  58199]     0 58199   183613     4356   253952        0          -900 snapd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216549] [ 126266]     0 126266    58181      398    86016        0             0 polkitd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216552] [ 840256]     0 840256   107913    14004   286720        0             0 fwupd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216554] [ 841890]     0 841890      654      121    45056        0             0 apt.systemd.dai
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216556] [ 841894]     0 841894      654      380    45056        0             0 apt.systemd.dai
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216558] [ 841925]     0 841925    77338    13629   389120        0             0 unattended-upgr
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216560] [ 842324]   107 842324     1697      179    53248        0             0 uuidd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216562] [ 843169]     0 843169    70163     1060   135168        0             0 packagekitd
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216564] [ 843196]     0 843196    87834    23683   393216        0             0 unattended-upgr
Feb 10 07:31:18 ip-1-2-3-4 kernel: [1512878.216566] [ 843232]     0 843232    87834    23118   294912        0             0 unattended-upgr

And after I add up total_vm I got only 2,477,307. It just around 2.5 megabytes. Obviously, it is not enough to trigger out of memory. If this is in kilobytes, then it is 2.5 Gb. That instance has only 512 Mb of memory, so if my theory is correct, it cannot allocate more than that.

But later I found that total_vm is in pages. Page on Intel compatible CPUs is 4096 bytes or 4 kilobytes. Then I found “process 511 (dotnet) total-vm:3007440kB”. It is clear, that dotnet allocated 3Gb and as result, I have to drop my theory about missing swap file.

It is obvious that my application leaks memory. Right solution would be to fix application, but anybody who deals with big application that is developed by big team knows that it is not easy thing to do. You fix one memory leak, and they add 2 more :) And to verify that there are no memory leaks, you will have to have tests for all possible scenarios, and it is a lot of work. Obviously, it is easier to simply restart your web site regularly or when it consumed too much memory.

For my case, I use 3rd party blogging platform and to be honest I don’t really have time to find source of that memory leak, so I chose simpler solution and decide to restart my web site regularly. And I will explain all details in next post.