Overview
Google Cloud Platform(GCP) does not expose the required life cycle hooks and APIs for scale-in control to the third party load balancers like NetScaler to implement connection draining in the GCP environment.
This article discusses a potential solution to achieve Connection draining with NetScaler on Google Cloud Platform.
Backend autoscaling in NetScaler
Backend auto scaling solution using NetScaler comprises a server farm. This server farm is a managed service offered by Google (cloud) and is called a managed instance group(MIG).
Every MIG is associated with a scaling policy that the customer defines and is used by the cloud in determining scaling decisions to provide elasticity to the group. Typical parameters that determine the load are CPU utilization and throughput.
This MIG is front ended by the NetScaler which acts as a load balancer. A scale-in operation involves shutting down one or more servers from the farm when perceived load is below the threshold levels and scale-out involves adding one or more servers to the farm when load is above the threshold levels.
NetScaler today periodically polls the backend server farm to get an update on whether there is a change in the server farm and makes changes to the local load balancing configuration based on the latest state of the servers it queries.
What is connection draining?
This is typically accomplished by delaying the shutdown of the server by various means so that existing connections being served by the server are completed.
The other aspect of connection draining is that new connections which get hashed to the server being taken away need to be hashed to a new server.
Problem with scale-in operation on GCP
As a result all the existing sessions that are currently being handled by that particular server can be terminated abruptly resulting in client server connections being reset. Some applications (like gaming, finance) are sensitive to such loss and require connection draining so that connection terminations are handled gracefully.
Unfortunately GCP does not provide any such hooks to achieve graceful termination. Though GCP internally implements connection draining for native load balancers.
Delayed shutdown of the application server
Note: In the example below, the Apache HTTP server, controlled by the httpd service systemd unit file, is used as an example. Please adjust the solution accordingly if using a different service.
Method 1
The systemd Unit corresponding to the httpd service (typically apache.service) can be modified to add a ExecStop directive to add a delay up to 120 sec.
ExecStop directive executes a command that follows it. If there is a sleep command, it would be executed prior to stopping the apache service. Please note that positioning of the directive is important here. The directives are executed in order.
Please see the sample code below.
[unit]Description=The Apache HTTP ServerAfter=network.target remote-fs.target nss-lookup.targetDocumentation=https://httpd.apache.org/docs/2.4/[service]Type=forkingEnvironment=APACHE_STARTED_BY_SYSTEMD=trueExecStart=/usr/sbin/apachectl start+ExecStop=/bin/sleep 120ExecStop=/usr/sbin/apachectl stopExecReload=/usr/sbin/apachectl gracefulPrivateTmp=trueRestart=on-abort[install]WantedBy=multi-user.target
Method 2
We can have a shutdown script defined as part of the metadata section of the GCP VM console. These shutdown scripts are plugged into the shutdown process of the server. However, GCP plugs in the shutdown script defined as part of the metadata after the http service in the shutdown sequence.
As a result, http service gets shut down and the shutdown script will be executed. To mitigate this problem, the shutdown sequence of the services/Units needs to be altered so that shutdown scripts are executed before the http service.
A sample shutdown script and code are as shown below.
[unit]Description=The Apache HTTP ServerAfter=network.target remote-fs.target nss-lookup.targetBefore=google-shutdown-scripts.serviceDocumentation=https://httpd.apache.org/docs/2.4/ [service] Type=forkingEnvironment=APACHE_STARTED_BY_SYSTEMD=trueExecStart=/usr/sbin/apachectl startExecStop=/usr/sbin/apachectl stopExecReload=/usr/sbin/apachectl gracefulPrivateTmp=trueRestart=on-abort [install]WantedBy=multi-user.target
Conclusion
Though small, both the above approaches require altering scripts/configuration of the image which is typically in the administrative domain of the customer. With both these approaches, shutdown could be extended up to 120 sec.
Please note that true connection draining may still not be achieved within this time window, in case of a workload that has longer lived connections.
For full connection draining support, NetScaler needs to have full control on the shutdown timing of the backend servers. However the current approach can be used to prevent abrupt termination of short lived connections.
Recommended Comments
There are no comments to display.
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now