Performance Testing OpenStack Telemetry – Ceilometer and Gnocchi

gnocchi-logoWith the arrival of Gnocchi in Tripleo deployed OpenStack Clouds, we decided it would be a good time to develop some tooling and tests to help characterize OpenStack Telemetry services with Gnocchi as a backend.

 

First off we checked over what telemetry services we needed to monitor and found we needed to add Gnocchi and Aodh processes to Browbeat’s collectd config files.  Also the Grafana dashboards were updated to display the newly collected metrics.  This was added in the following browbeat commit.

Gnocchi as a Backend

Check that Gnocchi is your cloud’s configured Ceilometer backend by grepping for meter_dispatchers in ceilometer.conf:

[root@overcloud-controller-0 ~]# egrep ^meter_dispatchers /etc/ceilometer/ceilometer.conf
meter_dispatchers=gnocchi

If it is not configured you can run the adjustment-ceilometer playbook to configure Gnocchi:

[akrzos@bithead ansible]$ ansible-playbook -i hosts browbeat/adjustment-ceilometer.yml -e "ceilometer_backend=gnocchi"

Stressing Telemetry

Next off we needed a way to stress the Telemetry services.  It was decided this could be accomplished by adjusting the Ceilometer polling interval then booting persisting instances, pausing for a few minutes, and then repeating until failures or system resources are exhausted.  The pausing will allow us to step the resulting resource utilization to view how much resource consumption occurs as the scale of instances in our cloud increases.  To change your Ceilometer polling interval with browbeat simply run the adjustment-ceilometer playbook with the interval you want to test with:

[akrzos@bithead ansible]$ ansible-playbook -i hosts browbeat/adjustment-ceilometer.yml -e "ceilometer_interval=60"

The Scenario

Once you are all set, we can review the persisting instances browbeat configuration to stress your cloud.  Depending on your environment size and hardware you may want to adjust several parameters in the specific scenario.  Parameters you may want to adjust:

  • sleep_after – amount of time we pause after completing an iteration of booting instances
  • scenarios – we added 10 scenarios so we would boot 10 x 20 instances (200 total instances)
  • image_name – we used cirros as the default to allow for really tiny instances
  • flavor_name – we used m1.xtiny which is added to your overcloud during Browbeat install

The cirros image combined with m1.xtiny will allow you to scale up the number of instances on a cloud that has few compute resources while being representative of a large cloud with many instances.  Do note that achieving a large density of instances with few computes means you are over-committing the hardware.  The point of this test is to stress the control plane telemetry services in your overcloud on the controllers and not the performance of a workload inside an instance on a compute node.  Typically, all bets for performance are off whenever you over-commit your hardware.

Results

For our hardware we reduced the Ceilometer polling interval from 600s to 60s to increase the amount of stress we are putting on Ceilometer/Gnocchi.  We ran the scenario and reviewed the rally output for booting the tiny instances as well as the system resource utilization.

Hardware:

  • 1 OSPd Node
  • 3 Controllers
  • 13 Computes

Software:

  • Tripleo Deployed OSP Newton
  • Gnocchi 2.2.1

Rally Results

rall-results.png

As you can see none of the boot scenarios failed.  The load duration timing did however increase over time as the System’s CPU and disks became more and more utilized.

System Performance:

CPU – In the below CPU graphs, you can see as more instances are booted, more CPU is being used on the controllers.  The system is well utilized by the end of this workload.controllers-cpu-boot-200.png
Digging into what is using more CPU on overcloud-controller-0, we can quickly see that httpd and gnocchi-metricd are using most of the CPU:
controller0-httpd-boot-200.png
Controller0-gnocchi-boot-200.png
Gnocchi’s API service is hosted in httpd with this setup and thus we see the httpd CPU usage.

Memory – Looking at the memory we do not see any dramatic growths in memory on any of the controllers during the time frame of this scenario.  Memory becomes more interesting for this environment if you scale to more instances.controllers-memory-boot-200.png

Disk – The disk for the overcloud controllers dramatically grew in number of iops through out the scenario:controllers-disk-iops-boot-200.png
As evident in the above disk iops graphs, we can see how the disk is being heavily utilized. We also examined the % of time the disks spent performing operations on the controllers:
controller0-disk--boot-200.png

More Instances!! – We actually repeated the above test until the cloud became unusable to find any other sore spots.  Eventually, memory does become a problem:
controllers-memory-continous-boot200.png
The actual memory issues became a problem with around ~460 instances booted.  With the source of extensive memory growth identified as ceilometer-collector:
controller0-ceilometer-collector-continous boot.png

Whats Next?

Well for one thing, we would like to add benchmarks of Gnocchi itself to Browbeat.  This could either be through a separate workload provider or through Rally itself.  This would allow us to provide additional performance metrics that are representative of a user’s experience reviewing the time series data Gnocchi aggregates.  We could run this periodically as we bump the scale of the cloud to allow us to see how the telemetry services perform while system resources are utilized.

The amount of resources utilized we saw with Gnocchi in this environment was greater than what we expected.  Do keep in mind that we ran the Ceilometer polling interval at 10x faster(60s) than what the default is (600s).  There is also the potential that Gnocchi is not configured for best resource utilization out of the box vs latency for processing metrics.  We are currently exploring the options in gnocchi.conf to reduce this load as well.

There are other Gnocchi configurations we could test such as Ceph or Swift for the storage backend.  We tested with the default storage configuration which is file.  Perhaps instances attached to networks and other resources provided by the cloud would provide additional load and would make for a great other Browbeat workload.

Lastly, we intend on continuing to performance test the telemetry services since we fully understand and believe in the power to collect, store, normalize and present performance data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s