You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

53 lines
2.4 KiB

  1. Large deployments of K8s
  2. ========================
  3. For a large scaled deployments, consider the following configuration changes:
  4. * Tune [ansible settings](http://docs.ansible.com/ansible/intro_configuration.html)
  5. for `forks` and `timeout` vars to fit large numbers of nodes being deployed.
  6. * Override containers' `foo_image_repo` vars to point to intranet registry.
  7. * Override the ``download_run_once: true`` and/or ``download_localhost: true``.
  8. See download modes for details.
  9. * Adjust the `retry_stagger` global var as appropriate. It should provide sane
  10. load on a delegate (the first K8s master node) then retrying failed
  11. push or download operations.
  12. * Tune parameters for DNS related applications
  13. Those are ``dns_replicas``, ``dns_cpu_limit``,
  14. ``dns_cpu_requests``, ``dns_memory_limit``, ``dns_memory_requests``.
  15. Please note that limits must always be greater than or equal to requests.
  16. * Tune CPU/memory limits and requests. Those are located in roles' defaults
  17. and named like ``foo_memory_limit``, ``foo_memory_requests`` and
  18. ``foo_cpu_limit``, ``foo_cpu_requests``. Note that 'Mi' memory units for K8s
  19. will be submitted as 'M', if applied for ``docker run``, and cpu K8s units
  20. will end up with the 'm' skipped for docker as well. This is required as
  21. docker does not understand k8s units well.
  22. * Tune ``kubelet_status_update_frequency`` to increase reliability of kubelet.
  23. ``kube_controller_node_monitor_grace_period``,
  24. ``kube_controller_node_monitor_period``,
  25. ``kube_controller_pod_eviction_timeout`` for better Kubernetes reliability.
  26. Check out [Kubernetes Reliability](kubernetes-reliability.md)
  27. * Tune network prefix sizes. Those are ``kube_network_node_prefix``,
  28. ``kube_service_addresses`` and ``kube_pods_subnet``.
  29. * Add calico-rr nodes if you are deploying with Calico or Canal. Nodes recover
  30. from host/network interruption much quicker with calico-rr. Note that
  31. calico-rr role must be on a host without kube-master or kube-node role (but
  32. etcd role is okay).
  33. * Check out the
  34. [Inventory](getting-started.md#building-your-own-inventory)
  35. section of the Getting started guide for tips on creating a large scale
  36. Ansible inventory.
  37. * Override the ``etcd_events_cluster_setup: true`` store events in a separate
  38. dedicated etcd instance.
  39. For example, when deploying 200 nodes, you may want to run ansible with
  40. ``--forks=50``, ``--timeout=600`` and define the ``retry_stagger: 60``.