par

Last Friday I casually updated our Jenkins server because it’s the right thing to do. I didn’t notice that Jenkins was about to be updated as well, next time I will pay attention to package proposed for upgrade…

So what went wrong ?

Probably the worst thing that should never break: our provisioning system. We had no build workers for a good two days because we were unable to provision new VMs.

Why did it fail ? There is a mix of things but to be concise, the upgrade of Jclouds plugin from 2.8 to 2.14 failed. Actually it didn’t fail with « I cannot upgrade » sort of errors but in more subtle manner, a couple of hours after the upgrade, when we noticed that our jobs were stuck and logs full of errors.

Since 4 years, we heavily rely on Jclouds to provision our Rackspace VMs for our build system. Thanks to this plugin, we assign to each job a specific environment (with Restrict where this project can be run parameter) and this label is linked to Jcoulds configuration. So when a job needs to be run with a label, Jcloud will create, on-demand, a new VM with this label. That’s very cost effective because when nobody pushes code, VMs can be deleted. Plus, as we get fresh VMs several time a day our build system needs to be 100% reproducible.

This worked great for about 4 years and we never really had to worry about that. The way it worked was simple:

  • A new VM was created with a given OS (Ubuntu) and a given version (14.04) on a given hardware.
  • After boot, Jcloud ran an Init script (a shell script) that prepared the environment (install docker, configure our back-up registry, configure ssh & Jenkins user)
  • and at the end Jenkins was copying the master ssh keys on the agent (to clone sources).

The upgrade broke because a lot of things changed in Jclouds, especially in credentials management and as I learned from the maintainer, « Init script » is not really reliable with openstack based provisioners. In addition to that we faced a couple of other bugs due to Jcloud mis-configuration.

Problem solved, recommended Rackspace configuration

Disclaimer: this works and was tested for Jclouds 2.14 and Jenkins 2.32.3, I hope this will stay true for a while but I cannot guaranty it will work forever.

Jclouds is very powerful but it comes with a drawback: it’s very hard to find the right options for your use case. So I will detail what we did to get it work on Rackspace.

First, Jclouds configuration. You need to set your name, API key & all. Verify everything works fine with « Test Connexion »

Then, for the given provider, you need to create at least one template (type of VM). This is where you configure:

  • The hardware and related number of executors (ie. # of parallel jobs)
  • The OS flavor and version
  • The label (important, this one is the same you will use in your job description)

And finally, the most important part, the « Misc. options ». For Rackspace, we only managed to get something that works with:

  • Init Script => none
  • User Data => cloudinit yaml file (more on that after)
  • Jenkins Credentials => configured according to cloudinit
  • Use Pre-existing Jenkins User => checked (managed by cloudinit)
  • Admin credentials => none (managed by cloudinit)
  • Wait for slave to phone home => checked + tailored Phone home timeout depending on your cloudinit complexity
  • Use config drive => checked (without it cloudinit won’t work)
  • Assign Public IP => unchecked (if checked, leads to a error: Floating IPs are required by options, but the extension is not available!)

So as you can see a great deal of configuration is delegated to cloudinit. So the cloudinit script we crafted looks like:

#cloud-config

packages:
 - openjdk-7-jdk

users:
- name: jenkins
  homedir: /jenkins
  ssh-authorized-keys:
    - "ssh-rsa AAAA..."
  shell: /bin/bash

runcmd:
  - curl -o /tmp/install-docker.sh https://get.docker.com
  - sh /tmp/install-docker.sh
  - usermod -aG docker jenkins
  - touch /tmp/known_hosts
  - ssh-keyscan -H tuleap.net >> /tmp/known_hosts
  - ssh-keyscan -H pkg.tuleap.net >> /tmp/known_hosts
  - ssh-keyscan -H -p 29418 gerrit.tuleap.net >> /tmp/known_hosts
  - ssh-keyscan -H github.com >> /tmp/known_hosts
  - install -o jenkins -g jenkins -m 0600 /tmp/known_hosts ~jenkins/.ssh/known_hosts

phone_home:
  url: https://ci.tuleap.org/jenkins/jclouds-phonehome/

It’s rather straightforward but there are few pitfalls you need to be aware:

  • Users are provisioned early in the process, in our case as we install docker from get.docker.com, the « docker » group was not yet created so we got an error trying to configure that in users section. We solved that by running manually usermod -aG docker Jenkins in runcmd
  • You need to manually install the JDK or the JRE (packages section) otherwise Jenkins agent won’t start
  • The public ssh key you configure for your Jenkins user must corresponds to the private key configured in Jenkins Credentials Jclouds template

Final step, jobs configuration

With the given configuration everything is fine, you will get your VM provisioned with your environment and Jenkins agent running. However, your ssh based jobs will fail because you didn’t deploy any private key. There is a possible strategy with write_files statement of cloudinit DON’T DO THAT.

First, it’s hard to configure (yeah I did it…) but there is worst: your ssh private keys will be written everywhere in the Jenkins logs. There is a better alternative: use Jenkins Credentials.

Credentials is a really good feature hidden under a cumbersome UI. It allows to specify credentials (sic) and re-use them in your jobs. Credentials can be defined globally, for system usage only or per folder. If you are using Pipelines, it’s very important that you set a credential ID to a meaningful string (for instance jenkins-gerrit-tuleap-net).

Once the credentials defined, we had to update our jobs in 3 manners:

  • For legacy « UI defined » jobs, next to where you configure the repository URL, select the right credential to use
  • For Pipeline jobs that are using checkout statement, specify the credentialsId in userRemoteConfigs:
            checkout(
                scm: [
                    $class: 'GitSCM',
                    userRemoteConfigs: [[
                        url: 'ssh://jenkins@gerrit.tuleap.net:29418/tuleap.git',
                       credentialsId: 'jenkins-gerrit-tuleap-net'
                    ]]
                ]
            )
    
  • For Pipeline jobs that need to run regular ssh commands, you need to use sshagent:
                parallel 'Gerrit': {
                    sshagent(['jenkins-gerrit-tuleap-net']) {
                        sh 'git push gerrit HEAD:master'
                        sh 'git push gerrit HEAD:security'
                    }
                }
    

I hope that the issue we had will be useful to someone else. I’m not sure we had a huge gain to switch to cloudinit (except the fact that it works again obviously) but at least our jobs are cleaner in term of credential usage.

As a final note I’d like to thank Rackspace for their fanatical support as well as support of Open Source with Tuleap project. I also want to thank Jcloud maintainer Fritz Elfert for his prompt guidance.