Running .NET Core app on AWS Lightsail Linux instance. Part 6

Previous part is here. This post will explain, how to restart your web site at specific intervals without returning errors to clients. You can scroll to solution, if you are no interested in my thoughts and to see what I try.

Thoughts and research

As I mentioned in this post, I decided to restart my web site periodically to avoid out of memory issue. And I would like to mention that it wasn’t easy to do. Obviously, restarting web site is super simple and all you need is to restart its service. But that means that your web site will be unavailable for some time. Depending on size of the web site, we are talking about seconds or even minutes. For my case it will be probably one or two seconds and it is not a big deal, but I would like to find proper solution.

Initially I was messing up with Kestrel server and thinking how to solve that problem. By default, when service is stopping, dotnet will finish all requests and then it will quit. And when I thought I was close to finding solution, I remembered that normally web browser does not interact with Kestrel, and it interacts with web server (nginx in my case). And when dotnet is shutting down, it will reject new connection and nginx will immediately return http status code 502 and it looks very ugly. I could expose Kestrel to internet, but it is not recommended way. Moreover, in this case, you will be able to serve only one web site. From the beginning I was planning to serve at least 2 web sites, so it wasn’t option to me.

Initially I thought it would be easy to configure nginx to retry connection on error and there will be enough time to shutdown and restart. It turns out, it is not possible. Nginx just will try all possible servers and if none of them ready, it will immediately return 502 error (or whatever error its get during connection). And trust me, I spent quite a lot of time searching internet. There are flimsy solutions with recursive error pages and LUA and I didn’t like this solution at all.

And I would like to state that this problem does not exist on Windows. IIS handles this natively and without any issues. As result, your site may restart every minute without affecting clients. Let me explain how it works. During restart, IIS will start second copy of your web site and after that all new requests will be processed by new web site. All old requests will be finished by old web site, assuming that they finish in specific timeout. After all requests finished, IIS will terminate first copy and all requests will be processed only by second copy.

Then I thought about having another proxy server that will mimic IIS behavior but will require wring some code and make it production ready will take quite some time. And because I was thinking in terms of IIS, it delays me for quite a bit. But after thinking for some time, I finally got how it supposed to work.

Solution

Idea is simple. Just before starting graceful shutdown of the main web site, we will start a temporary copy of the same web site on different port. Then it will process all requests while main web site is restarting. And after main web site started and when it is ready, we will gracefully shutdown temporary copy.

And obviously your web site should not do something strange on shutdown and persist some persistent data in external storage and not save something locally on shutdown.

The last bit we need is to know when new site is ready to process requests, but as turns out it is implemented by ASP.NET team.

Ok, firstly modify /etc/systemd/system/www.example.com.service file to this content:

[Unit]
Description=Example service

[Service]
Type=notify
WorkingDirectory=/var/www/www.example.com
ExecStartPost=+/usr/bin/systemctl stop www.example.com.backup
ExecStart=/usr/bin/dotnet /var/www/www.example.com/Project.dll
ExecStop=+/usr/bin/systemctl start www.example.com.backup
Restart=always
# Restart service after 1 seconds if the dotnet service crashes:
RestartSec=1
RuntimeMaxSec=2days
KillSignal=SIGTERM
SyslogIdentifier=www-example-com
User=www.example.com
Environment=ASPNETCORE_ENVIRONMENT=Production
Environment=DOTNET_PRINT_TELEMETRY_MESSAGE=false

[Install]
WantedBy=multi-user.target

I will explain changes. Type=notify is used to have communication channel between systemd and dotnet process. Runtime will notify systemd when it is ready and when it finished shutdown.

ExecStartPost=+/usr/bin/systemctl stop www.example.com.backup is used to stop temporary copy after main web site started and ready to process requests.

ExecStop=+/usr/bin/systemctl start www.example.com.backup is used to start temporary copy and ensure that it is ready before stopping main site.

RuntimeMaxSec=2days is tells systemd for how long it should keep this service running before stopping it by timeout. Normally it will stop service. But because we have Restart=always it will just restart it and it will execute ExecStop command and eventually it will execute ExecStartPost command.

KillSignal=SIGTERM is a bit tricky. Normally Microsoft recommend using KillSignal=SIGINT. But if SIGINT is used with UseSystemd function then runtime will terminate web site instantly without waiting for pending requests. But source code here states that runtime will gracefully shutdown only on SIGTERM signal and other signals “won't cause a graceful shutdown of the systemd service”. It also took me some time to figure out.

Please note + symbol before commands. Without that, command will be executed from use that specified in User line of the service and regular users do not have rights to start and stop services. “If the executable path is prefixed with "+" then the process is executed with full privileges.”

Next step is to create this file /etc/systemd/system/www.example.com.backup.service with following content:

[Unit]
Description=Example service backup

[Service]
Type=notify
WorkingDirectory=/var/www/www.example.com
ExecStart=/usr/bin/dotnet /var/www/www.example.com/Project.dll --urls=http://localhost:5001/
RuntimeMaxSec=60
Restart=no
KillSignal=SIGTERM
SyslogIdentifier=www-example-com-backup
User=www.example.com
Environment=ASPNETCORE_ENVIRONMENT=Production
Environment=DOTNET_PRINT_TELEMETRY_MESSAGE=false

[Install]
WantedBy=multi-user.target

As you can see there are not many changes here. It is copy/paste of /etc/systemd/system/www.example.com.service but with some changes. Firstly, we don’t need to restart this service at all. Secondly, we limit it run time to 60 seconds. Obviously, we will change SyslogIdentifier and we change its description. Quite simple.

Next step is to modify CreateHostBuilder  function in Program.cs. You need to insert UseSystemd between Host.CreateDefaultBuilder(args) and .ConfigureWebHostDefaults(webBuilder =>, like this:

Host.CreateDefaultBuilder(args)
.UseSystemd()
.ConfigureWebHostDefaults(webBuilder =>

UseSystemd is implemented in Microsoft.Extensions.Hosting.Systemd package.

Then you need to run sudo systemctl daemon-reload. Please note, that you don’t need to call sudo systemctl enable www.example.com.backup because temporary web site should not start when system restarts.

By default, dotnet will wait 5 seconds on shutdown to process remaining requests. If you need to wait longer, you will need to add this line in ConfigureServices function in Startup.cs:

services.Configure<HostOptions>(opts => opts.ShutdownTimeout = TimeSpan.FromSeconds(50));

Obviously, time should be shorter than value of RuntimeMaxSec value in backup service. Also, by default TimeoutStopSec is 90 seconds for most distributives so plan accordingly or increase this value.

Ok and here is last step to tell nginx to looks for temporary web site. Edit file /etc/nginx/sites-enabled/www.example.com and add this before server line:

upstream backend {
    server localhost:5000 max_fails=0;
    server localhost:5001 backup;
}

And change proxy_pass value to http://backend; and remember about semicolon at the end. You can use different name instead of backend, it just should be the same in upstream line and in proxy_pass.

First line, with port 5000, tells that it will be our main server. max_fails=0 stating that even connection failed, that server should not be dropped and try on next request. You can change it to 1 and then if it failed during fail_timeout seconds once then it will be considered unavailable during fail_timeout seconds and nginx will not even try it. fail_timeout by default is 10 seconds. You can play with these settings. They are described here.

Second line, with port 5001 tells nginx that this server should be tried only when first server failed. If first server working fine, nginx will never try to use second one. But without backup statement, every second request will try to use second server and attempt to connect to temporary web site. And because most of the time it is stopped, it will fail and then nginx will re-route it to first. It just waste of CPU time.

You can use this technique to restart site when it consumed certain amount of memory. I’m not sure if this is possible to do it using only systemd, but you can have cron job that periodically checks amount of memory and restarts service.

I hope it helps someone.