class: center, middle, inverse # Scaling and monitoring your puppetserver for thousands of clients ## Includes all pitfalls! --- ## $ whoami * Tim 'bastelfreak' Meusel * DevOps Engineer at GoDaddy EMEA * Puppet Contributor since 2012 * Merging stuff on Vox Pupuli since 2015 * Vox Pupuli PMC member ??? --- .left-column[ # starting point ## Puppetserver ] .right-column[  ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? * Who of you runs puppetserver? * This is how it all starts, a single box --- .left-column[ # starting point ## Puppetserver ## PuppetDB ] .right-column[  ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? * Who of you runs puppetserver? * This is how it all starts, a single box --- .left-column[ # starting point ## Puppetserver ## PuppetDB ## Postgresql ] .right-column[  ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? --- .left-column[ # starting point ## Puppetserver ## PuppetDB ## Postgresql ## PuppetBoard ] .right-column[  ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? --- .left-column[ # starting point ## Puppetserver ## PuppetDB ## Postgresql ## PuppetBoard ## Foreman ] .right-column[  ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? --- .left-column[ # starting point ## Puppetserver ## PuppetDB ## Postgresql ## PuppetBoard ## Foreman ## gnatsd ] .right-column[  ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? * messaging daemon for choria -> mcollective replacement --- class: center, middle, inverse ## Coworker: Puppet agents are slow, can you please check the monitoring? --- class: center, middle, inverse ## Coworker: Puppet agents are slow, can you please check the monitoring? ## Uhm, which monitoring? ??? --- .logo[  ] .footnote[© [camptocamp](https://www.camptocamp.com/de/actualite/an-orchestrated-puppet-infrastucture-with-docker-and-rancher/)] --- .logo[  ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? * I will now talk about the improvements that we did * They are ordered the way we did them, not the optimal way * People should probably start with metrics... --- .left-column[ ## Improvements ### Puppetserver ] .right-column[ * Uses jruby within a JVM * One jruby instance compiles one catalog at a time * More instances => more catalogs per minute * Use [theforeman/puppet](https://forge.puppet.com/theforeman/puppet) to configure puppeterver ```yaml --- puppet::server_max_active_instances: %{facts.processors.count} ``` ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? If there is nothing on a server except for puppetserver, we can give it as many jruby instances as cores --- .left-column[ ## Improvements ### Puppetserver ] .right-column[ * Puppetserver can cache code by loading it from disk to ram * \+ Minimal decreased compilation time for each catalog * \- You should clean the cache after each deploy ```yaml --- puppet::server_environment_class_cache_enabled: true ``` * cleaning the cache: ```sh curl -i --cert $(puppet config print hostcert) \ --key $(puppet config print hostprivkey) \ --cacert $(puppet config print cacert) \ -X DELETE \ https://$(puppet config print server):8140/puppet-admin-api/v1/environment-cache?environment=production ``` ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? It can be purged globally or per env. Cache can be configured for each env --- .left-column[ ## Improvements ### Puppetserver ] .right-column[ * [theforeman/puppet](https://forge.puppet.com/theforeman/puppet) creates the `development` and `production` environment * r10k purges unknown environments * [theforeman/puppet](https://forge.puppet.com/theforeman/puppet) restarts puppetserver if it creates an environment * I have no git branches named `development` nor `production` in my control repo... * Each environment deploy lead to a restarted puppetserver for weeks ```yaml --- # don't create development/production env puppet::server_environments: [] # don't create /etc/puppetlabs/code/environments/common puppet::server_common_modules_path: '' ``` ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] --- .left-column[ ## Improvements ### Puppetserver ] .right-column[ * JVM has a configureable minimal and maximal amount of memory to allocate * Memory is shared across all jruby instances (and other threads) * Puppet docs suggest that minimal=maximal memory * That is based on Java 6 docs, so probably outdated * Required memory per instance depends entirely on the codebase (modules) * 2GB seem to work out fine for my setup ```puppet # How do I do this with hiera? $cpu_count_twice = $facts['processors']['count'] * 2 $cpu_count = $facts['processors']['count'] * 1 class{'puppet': server_jvm_min_heap_size => "${cpu_count}G", server_jvm_max_heap_size => "${cpu_count_twice}G", } ``` ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? higher minimal heap size => higher startupt time --- .left-column[ ## Improvements ### Puppetserver ### Foreman ] .right-column[ * foreman supports caching out of the box * we use [saz/memcached](https://forge.puppet.com/saz/memcached) to provision [memcached](https://memcached.org/) ```yaml # 50GB of cache memcached::max_memory: 51200 foreman::plugin::memcache::hosts: - 127.0.0.1 ``` ```puppet include memcached include foreman::plugin::memcache ``` ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? * Foreman still feels slow * memcached is a memory object caching system * Helps caching $things for foreman * Is required if you ever run a distributed foreman --- .left-column[ ## Improvements ### Puppetserver ### Foreman ] .right-column[ * `passenger-status` says it only runs with a single process.. * [theforeman/foreman](https://forge.puppet.com/theforeman/foreman) uses [puppetlabs/apache](https://forge.puppet.com/puppetlabs/apache) to configure [passenger](https://www.phusionpassenger.com/library/admin/apache/overall_status_report.html) ```yaml --- apache::mod::passenger::passenger_max_pool_size: %{facts.processors.count} apache::mod::passenger::passenger_min_instances: %{facts.processors.count} ``` ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? after this we see database issues in the foreman logfile --- .left-column[ ## Improvements ### Puppetserver ### Foreman ] .right-column[ * We use [theforeman/foreman](https://forge.puppet.com/theforeman/foreman) to manage foreman * Tuned puppetserver results in more requests to foreman ```yaml --- # default is 5 foreman::db_pool: 20 foreman::keepalive: true foreman::max_keepalive_requests: 1000 foreman::keepalive_timeout: 180 ``` ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? * Puppetserver server isn't running at full load, but agents are slow * maybe the node.rb is slow as hell? * keepalive is active by default, but only for 5 seconds / 100 requests * keeps the connection to puppetserver open and reduces overhead (handshake/TLS) * We just killed our database with too many connections --- .left-column[ ## Improvements ### Puppetserver ### Foreman ### PostgreSQL ] .right-column[ * This deserves a dedicated conference (there actually is) * Attend the [PostgresConf](https://postgresconf.org/conferences/) or the PostgreSQL DevRoom at [FOSDEM](https://fosdem.org/2019/) * Ask people in #postgresql on freenode * Use at least postgres 10 and rely on the upstream repos if possible * Don't use harddrive, SATA/SAS SSDs and NVMe SSDs are the way to go * Execute pgtune ```sh pgtune -i /var/lib/pgsql/10/data/postgresql.conf -o postgresql5.conf ``` ```puppet postgresql::server::config_entry{'max_connections': value => 400, } ``` ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? * databases are so compley * Upstream postgres supports Debian and RHEL/IBM based operatingsystems * postgres10 supports parallel/multithreaded queries * puppetdb doesn't support postgres10 yet, but it works. support comes with the next minor release * we had no issues with garbadge collection --- .left-column[ ## Improvements ### Puppetserver ### Foreman ### PostgreSQL ### PuppetDB ] .right-column[ * Simply JVM service that scales good with the amount of threads * It's a RESTful service that stores data in PostgreSQL ```yaml puppetdb::server::java_args: '-Xmx': '8192m' '-Xms': '2048m' puppetdb::server::node_ttl: '14d', puppetdb::server::node_purge_ttl: '14d', puppetdb::server::report_ttl: '999d' # default is 50 puppetdb::server::max_threads: 100 # default is processorcount / 2 puppetdb::server::command_threads: %{facts.processors.count} # default is 4, have your database in mind puppetdb::server::concurrent_writes: 8 puppetdb::server::automatic_dlo_cleanup: true ``` ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? --- .left-column[ ## Scaling ### The Idea ] .right-column[ .logo[  ] ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? Depening on the amount of resources / agent runinterval, a single puppetserver just doesn't have enough resources scale out instead of scale up Master of Masters. The agent on all other masters request a catalog from the MoM We don't want a loadbalancer as that would be a single point of failure --- .left-column[ ## Scaling ### The Idea ### tk-jetty9 ] .right-column[ [Puppetserver docs](https://github.com/puppetlabs/trapperkeeper-webserver-jetty9/blob/master/doc/jetty-config.md): * `selector-threads`: This sets the number of selectors that the webserver will dedicate to processing events on connected sockets for unencrypted HTTPS traffic. No known upper limit * `ssl-selector-threads`: same as `selector-threads`, just for HTTPS. "Defaults to the number of virtual cores on the host divided by 2, with a minimum of 1 and maximum of 4" ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] --- .left-column[ ## Scaling ### The Idea ### tk-jetty9 ] .right-column[ .logo[  ] ] .footnote[© [knowyourmeme.com](https://knowyourmeme.com/memes/wat)] --- .left-column[ ## Scaling ### The Idea ### tk-jetty9 ### The Setup ] .right-column[ * Deploy nginx on each Puppetserver server to terminate TLS * Increase default threadcount from 1 to $more... * Bind puppetserver to localhost with http * And don't require TLS client certificates * Setup consul for dynamic loadbalancing across all puppetservers * Soonish available at [github.com/bastelfreak/puppetcontrolrepo](https://github.com/bastelfreak/puppetcontrolrepo#puppet-all-in-one-stack-on-centos-7) ] .footnote[[@bastelsblog](https://twitter.com/bastelsblog) for [@voxpupuliorg](https://twitter.com/voxpupuliorg)] ??? * consul provides DNS loadbalancing * We can provide a health script to check if puppetserver still works, otherwise consul disables the node * repo has unit and acceptance test * Explain why DNS round robin via normal DNS servers sucks --- .left-column[ ## Metrics ] .right-column[ * Puppetserver can expose graphite * A graphite stack looks very complicated * Puppetserver has a metrics API and exposes [JMX data](https://www.oracle.com/technetwork/articles/java/javamanagement-140525.html) via [Jolokia](https://jolokia.org/) * We can load a prometheus exporter into the JVM to write metrics into a prometheus instance ```yaml puppet::server_jvm_extra_args: - '-javaagent:prometheus.jar=127.0.0.1:9020:config.yaml' ``` ] ??? JMX data and JVM debugging is a complex topic in theory JMX is writeable, you can trigger heapdumps, didn't work the prometheus javaagent is developed and support by the prometheus team puppetdb/puppetserver expose custom + std JVM metrics works for puppetdb as well foreman metrics are availble since a few weeks, didn't test dont kill your puppetserver by pulling data too often --- .left-column[ ## Metrics ### Generic JMX ] .right-column[  ] --- .left-column[ ## Metrics ### Generic JMX ] .right-column[  ] --- .left-column[ ## Metrics ### Puppetserver ] .right-column[  ] --- .left-column[ ## Metrics ### Puppetserver ] .right-column[  ] --- .left-column[ ## Summary ] .right-column[ Scaling a Puppetserver stack * It's a complex distributed system with many tunables and pitfalls * Start with proper monitoring instead of guessing * Best practice [controlrepo](https://github.com/bastelfreak/puppetcontrolrepo#puppet-all-in-one-stack-on-centos-7) with all tuneables, explanations, unit/acceptance tests * Contact: [tim@bastelfreak.de](mailto:tim@bastelfreak.de) or bastelfreak on freenode * [Collection of related talks](https://github.com/bastelfreak/talks) ### Thanks for your attention! ]