Recently we’ve been having issues with one of our client’s servers rebooting nightly and taking down the application running on that server. Rancher should automatically start up the application once the server is up again but the rancher-agent wasn’t starting up by itself, causing Rancher to not see the server status. We had to manually log into the server and run:
sudo /usr/bin/docker run -d --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.0.2 <URL>
So here is how you can fix this using the built in autorun system called systemd in CoreOS.

NOTE: Anywhere you see a <URL> in this post, it is a reference to Rancher’s host URL. You can get this URL via your Rancher application UI under Infrastructure > Hosts > Add Host > Custom

The Install

To setup an autorun process, we need to create a service file under the /etc/systemd/system directory called rancher-agent.service:
sudo nano /etc/systemd/system/rancher-agent.service
Enter the follow into the file:

[Unit]
Description=Rancher Agent

[Service]
Type=oneshot
RemainAfterExit=yes
SuccessExitStatus=137
ExecStartPre=/usr/bin/docker pull rancher/agent:v1.0.2
ExecStart=/usr/bin/docker run -d --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.0.2 <URL>
ExecStop=/usr/bin/docker stop rancher-agent

The [Unit] section just has metadata about this service
The [Service] section has the details on how to start and stop the service.

Before we can get into the details of the [Service] section, we need to understand how the Rancher Agent works.

When you run the command sudo /usr/bin/docker run -d –privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.0.2 <URL>  a few things happen:

  1. The rancher/agent that starts up will shortly exit with an error code of 137.
  2. A new rancher/agent will load up in it’s place and it will always be named rancher-agent.
  3. A new rancher agent instance will start up (an image of rancher/agent-instance:v0.8.1). This is the network agent that communicates with your Rancher host application.
  4. If you have a load balancer configured (via Rancher application), it will start up under the image rancher/agent-instance:v0.8.3.
  5. Any other docker containers that should be on this host will start firing up.

So with all those things in mind, here is what the [Service] script is doing:

Type=oneshot           # Sets up the type of service this will be. Oneshot service means it will exit shortly after starting.
RemainAfterExit=yes    # Causes systemd to consider the unit to be active if the start action exited successfully
SuccessExitStatus=137  # Rancher/agent quits the first process with an error code 137, this will prevent systemd seeing this as a failure to start.
ExecStartPre=/usr/b... # This ensures to pull the latest version of the rancher/agent:v1.0.2 from docker hub. (this is optional)
ExecStart=/usr/bin/... # This is the actual command to start the process.
ExecStop=/usr/bin/d... # This is the stop command. NOTE: We're stopping the second rancher/agent which has the name rancher-agent

Now that we have our service script file written, we need to enable it to start after a reboot by running this command:
sudo systemctl enable /etc/systemd/system/rancher-agent.service
Keep in mind that this only sets it up to start running during booting.
To start the script now, we have to execute the following command:
sudo systemctl start rancher-agent.service
There won’t be any output from the command, however, you can see the status of your service by running:
sudo systemctl status rancher-agent.service
If you go to your Rancher Application, you should also see your host there.

Rancher

Troubleshooting

While working on this, we had many issues but we won’t get into here. The important thing to know is how to work with systemd to reload and test your script.

Whenever you edit the rancher-agent.service file, you want to make sure to do the following steps:

  1. Stop the current service by running:
    sudo systemctl stop rancher-agent.service
  2. Reload the service file(s) by running:
    sudo systemctl daemon-reload
  3. Start the service by running:
    sudo systemctl start rancher-agent.service
  4. Check the status:
    sudo systemctl status rancher-agent.service

Hopefully this will help you troubleshoot your issues and get this back up and running.

 

Feel free to leave a comment below with any questions and we’ll do our best to answer your questions.

%d bloggers like this: