Leveraging the Power of Elasticsearch in the Performance World

 

elastic

Data is power in the world of performance engineering. Be it test results, system metrics, tunable knobs or record shattering benchmarks, data is everywhere. Read on to know how we at Red Hat are leveraging the power of Elasticsearch with Browbeat to lead the next revolution in performance engineering.

 

Do Numbers Mean Anything?

Test Data and benchmarks can be used to get valuable insights into how a system is performing under given conditions. The words ‘given conditions’ are really important here as me saying,  “I got 2 Gbps throughput from my network interface card” doesn’t quantify performance meaningfully. Without providing details about the system architecture, CPUs, line rate and a whole bunch of other things, the number 2 Gbps doesn’t speak for itself.

 

The Future of Performance Engineering is NOW

Performance engineers have been using spreadsheets to record, track and analyze data forever. As powerful as they are, anyone who has worked with spreadsheets knows how cumbersome it is to organize them. With Browbeat, we have taken technologies that are already out there like Elasticsearch and integrated them into our framework, thereby creating a convenient way to store, organize and analyze test results without the hassles of using CSVs and spreadsheets.

 

Think that is cool?  Wait until you hear how we solve the problem of coming up with names or tags like ‘Westmere-32-core-Mellanox-Connectx3-40Gbps-Neutron-api-workers-32-Run-1-20160714-164017″ to associate test result data with the ‘given conditions’.

We use Ansible to gather data about how the system and software under test are configured, so you can simple query Elasticsearch for something like “neutron_workers: 32” and it would show you all results that match that criteria. Since each test result is bundled with metadata about how the environment was configured, it doesn’t require any effort from the user to associate test results with environment conditions. With this level of automation in place, an engineer is doing more, faster and smarter everyday by utilizing time on the actual analysis and inference of results than on organizing and tracking them.

 

Tools of the Trade

Let’s dig a bit deeper into how we package a solution in Browbeat using Ansible, Elasticsearch and Kibana to gain quick and priceless insights into OpenStack Performance. For starters let’s briefly define the technologies used in this solution:

Ansible, an automation platform for configuring and managing servers

Elasticsearch, an enterprise class data store and search engine

Kibana, browser-based analytics and search dashboard for Elasticsearch

Currently, with Browbeat we are able to index and analyze data from Rally and Shaker test results and now is the time to see how we do it.

The entire engineering/code behind this work can be broken down into three parts; gathering configuration details of the cloud as metadata, massaging the tool results, and Indexing the data in Elasticsearch to visualize it via Kibana. Lets look at each of these parts in detail.

 

Gathering Metadata

This task is accomplished using Ansible Facts. Facts are information gathered from remote systems by Ansible. By default, Ansible gathers a lot of data about the system it is talking to, including but not limited to network interfaces, Operating System, Kernel, CPU architecture etc. While this is a lot of useful data, the missing piece is configuration details of the software, which in our case is OpenStack.  Since we weren’t getting that piece for free from Ansible we wrote playbooks that go through OpenStack configuration files on all nodes and set these configuration parameters as Ansible Facts. With OpenStack configuration loaded into Ansible Facts, we dump the facts from all the nodes onto the node on which Browbeat is running.

Although we initially tried passing all of the Ansible facts as metadata, we soon realized that it was overwhelming Elasticsearch by creating tons of unique keys, which seemed to be a bad idea anyway. So, we wrote a python class to massage the facts JSON, flatten it out and cleverly grab only facts of interest into three different JSONs, one each for Hardware metadata (hardware details of nodes), Software metadata (OpenStack configuration) and Environment Metadata (number of controllers etc.).  The Elasticsearch connector class adds data in each of these files as metadata to every single test result, which gives us the power to associate results with ‘given conditions’ – the way the cloud was setup. The idea is that we run the playbook to gather this metadata every time we have changes in configuration and if there’s no changes at all these files can be reused for every run of Browbeat.

 

Massaging Tool Results

Sending results from the tool as is to Elasticsearch is a bad idea since it might diminish the value of the data itself by limiting its usability and searchability. We made several prototypes of how we index the JSON output from the tool as a result of which we improved the value we get out of the result data. When I say value from the result data, it translates into how granular the searches can be made and if we are able to get the visualizations for the use cases we want. Querying by action name, scenario name, being able to visualize a scenario with stats for each atomic action, keeping count of all errors and successful results are all things that have been made possible by the manipulations we do to the native JSON outputted by Rally.

It a whole new story with Shaker. Like Rally, the JSON outputted by Shaker has also not been constructed with the idea to be indexed into Elasticsearch. Moreover, the fact that Shaker results have to be represented as ‘Throughput vs Time’, makes it even more challenging given that the JSON doesn’t present data with unique time stamps per data point. The JSON contains throughput sampled over 60 seconds in a 1 second interval as a list , presented as one record per guest involved in the network testing. With our first prototype, we were able to successfully send and retrieve result data from Elastic but on visualizing in Kibana, we were not seeing what we hoped to see. Instead of seeing a ‘Throughput vs Time’ chart, we were seeing just one vertical bar which represented the average of all the throughput values. On further investigation, we understood that with the way we have our data modeled currently, we are limited by Kibana.

We  found a slick workaround in the form of fake timestamps. We pre-populated some timestamps during each Browbeat run and associated each throughput value  in a record with these timestamps and passed the data along to Elasticsearch. On visualizing the data in Kibana, we were able to see a ‘Throughput vs Time’ chart. Kibana was still showing us the average value per timestamp but now we had multiple timestamps (one per data point). There were a few other things that became easier to visualize with our fake timestamp model. For example consider a Shaker test with concurrency set to 4. In this scenario,  Shaker launches 4 master-slave VM pairs and does throughput testing concurrently. The results JSON contains one record per each master VM (server) per concurrency, and by using the same set of timestamps for each record, we can see aggregate and average throughput charts for the 4 VMs, thereby getting a true sense of the network capacity of the cloud.

 

Indexing the Results in Elasticsearch and Visualizing them via Kibana

Well, this last step is the moment of truth. We have a python class that takes the result data passed to it, bundles it with the metadata generated earlier, and pushes the data as a record to Elasticsearch. The index is of the form “[browbeat-rally-]YYYY.MM.DD” or “[browbeat-shaker-]YYYY.MM.DD” depending on the tool name. So every record passed to Elasticsearch will now have metadata about the cloud, metadata about the tool results (concurrency, times etc. in case of Rally and progression, heat template etc. in case of Shaker) and the actual result data. The Rally results are passed as one record per atomic action and the shaker results are passed as one record per every data point (that’s a lot of records per test, but helps us get the visualizations we need).

The real value in this work is realized when trying to analyze and make inferences from test results, as the tools in their native form only report results for that particular run and that too without any references to the environment they are run in. With Browbeat we can track trends and identify bottlenecks by tuning parameters in OpenStack and visualizing that performance in Kibana.

 

Let us now look at a few dashboards that were built from result data that Browbeat sent over to Elasticsearch.

 

keystone

The above dashboard tracks the response times for a Keystone authenticate scenario vs concurrency when Keystone is configured in HTTPD. We could easily isolate test runs that had keystone running in HTTPD as keystone configuration data was passed as metadata.

 

 

neutron

The chart above shows the response times for various atomic actions involved in creating and listing a neutron router with Neutron API workers tuned to 16, 32, 48 and 64 (left to right). It can be seen that a worker count of 16 provides lower average response times (lower being better) contrary to the general belief that optimal performance is obtained by tuning workers to core count (32 in this case). Again, this comparison has only been made possible since we pass neutron configuration as metadata to Elasticsearch.

 

Let’s look at a sample visualization for Kibana for Shaker.

shaker

In the above chart, we have been able to query Elasticsearch to provide the results for a scenario with 8 VMs concurrently blasting traffic, across two compute nodes, when the iptables hybrid firewall driver was being used for Neutron.  It is interesting to note that we were able to plot the sum of throughputs across all VMs since results from each VM have been indexed with the same exact time stamp, allowing Kibana to aggregate results.

 

I am a Browbeat user and I do not care about what happens under the hood. What should be my workflow to make use of the Elasticsearch integration?

In the Browbeat configuration file, you would need to enable elasticsearch and give the values for the IP address of the elasticsearch instance and the port number.

Before kicking off the Browbeat Test suite, you would need to run,

ansible-playbook -i hosts gather/site.yml

This playbook dumps all the metadata we need, which is further processed by the Metadata python class. After running the playbook, you would start Browbeat the usual way,

./browbeat.py rally shaker

Whenever you tune any parameters in OpenStack or make any other changes on the nodes be sure to run the playbook again, as that ensures Browbeat always indexes results with most recent/correct metadata.

Some lessons learned

  1. Spend enough time modeling your data and getting the mapping right since that could mean the difference between getting 100% value from your data and no value at all although the data is sitting there in Elasticsearch.
  2. Always visualize the data in Kibana, to make sure what you are seeing is what you intend to.  Visualizing the data is an easier way to ensure you got the data modeling/mapping right.
  3. It is more than likely that you won’t get the mapping right the first time,  it is an iterative process that needs some refining.
  4. Data is good. Metadata is great. But when you bundle them, the possibilities are infinite.

Hope you have a great time playing with performance data!

14 thoughts on “Leveraging the Power of Elasticsearch in the Performance World

  1. Hi guys, nice work. This is what I was looking for but I’m not interested in ansible facts (hardware stuff), I just want rally output in elasticsearch with your modifications (the two python classes: ES and Rally). I have my own ES+ kibana and rally running from virtualenv.
    I git pulled your code, modified the browbeat-config.yaml and add the IP of my ES and run “browbeat.py rally”, I had some issues with some modules but installed them with pip.
    After I installed the modules, during the run, you call an ansible playbook to publish stuff in grafana. Can this step be “switched off”?
    this part:

    2016-08-19 15:49:54,152 – browbeat.Grafana – INFO – 20160819-154021-browbeat-create-and-list-users-10-iteration-0 – Grafana Dashboard openstack-general-system-performance URL: http://1.1.1.1:3000/dashboard/db/openstack-general-system-performance?from=1471621602966&to=1471621794151&var-Cloud=openstack
    2016-08-19 15:49:54,152 – browbeat.Grafana – INFO – Snapshot command: ansible-playbook -i ansible/hosts ansible/browbeat/snapshot-general-performance-dashboard.yml -e “grafana_ip=1.1.1.1 grafana_port=3000 from=1471621602966 to=1471621794151 results_dir=results/20160819-154021/keystonebasic/create-and-list-users/20160819-154021-browbeat-create-and-list-users-10-iteration-0 var_cloud=openstack “

    Like

  2. Thanks for that. Any idea why I get this error?

    (rally)root@rally:~/browbeat# cat ./results/20160822-072227/browbeat-Rally-run.log
    2016-08-22 07:22:27,545 – browbeat.Rally – DEBUG – ——————————–
    2016-08-22 07:22:27,545 – browbeat.Rally – DEBUG – task_file: rally/authenticate/keystone-cc.yml
    2016-08-22 07:22:27,545 – browbeat.Rally – DEBUG – scenario_args: {‘sla_max_failure’: 0, ‘sla_max_avg_duration’: 6, ‘concurrency’: 64, ‘sla_max_seconds’: 30, ‘times’: 500}
    2016-08-22 07:22:27,545 – browbeat.Rally – DEBUG – result_dir: results/20160822-072227/authenticate/authentic-keystone
    2016-08-22 07:22:27,545 – browbeat.Rally – DEBUG – test_name: 20160822-072227-browbeat-authentic-keystone-64-iteration-0
    2016-08-22 07:22:27,545 – browbeat.Rally – DEBUG – ——————————–
    2016-08-22 07:25:37,738 – browbeat.Rally – ERROR – Cannot find task_id
    2016-08-22 07:25:37,739 – browbeat.Rally – INFO – Current number of Rally scenarios executed:1
    2016-08-22 07:25:37,740 – browbeat.Rally – INFO – Current number of Rally tests executed:1
    2016-08-22 07:25:37,740 – browbeat.Rally – INFO – Current number of Rally tests passed:0
    2016-08-22 07:25:37,740 – browbeat.Rally – INFO – Current number of Rally test failures:1
    2016-08-22 07:25:37,740 – browbeat.Rally – INFO – Skipping authentic-neutron scenario enabled: false
    2016-08-22 07:25:37,740 – browbeat.Rally – INFO – Skipping authentic-nova scenario enabled: false
    2016-08-22 07:25:37,740 – browbeat.Rally – INFO – Benchmark: cinder
    2016-08-22 07:25:37,741 – browbeat.Rally – DEBUG – Default Concurrencies: [2]
    2016-08-22 07:25:37,741 – browbeat.Rally – DEBUG – Default Times: 6
    2016-08-22 07:25:37,741 – browbeat.Rally – INFO – Skipping create-attach-volume-centos scenario enabled: false
    2016-08-22 07:25:37,741 – browbeat.Rally – INFO – Running Scenario: create-attach-volume-cirros
    2016-08-22 07:25:37,741 – browbeat.Rally – DEBUG – Scenario File: rally/cinder/cinder-create-and-attach-volume-cc.yml
    2016-08-22 07:25:37,741 – browbeat.Rally – DEBUG – Overriding Scenario Args: {‘image_name’: ‘cirros’, ‘flavor_name’: ‘m1.tiny’}
    2016-08-22 07:25:37,741 – browbeat.Rally – DEBUG – Created result directory: results/20160822-072227/cinder/create-attach-volume-cirros
    2016-08-22 07:25:37,741 – browbeat.Rally – DEBUG – ——————————–
    2016-08-22 07:25:37,741 – browbeat.Rally – DEBUG – task_file: rally/cinder/cinder-create-and-attach-volume-cc.yml
    2016-08-22 07:25:37,741 – browbeat.Rally – DEBUG – scenario_args: {‘image_name’: ‘cirros’, ‘flavor_name’: ‘m1.tiny’, ‘concurrency’: 2, ‘times’: 6}
    2016-08-22 07:25:37,741 – browbeat.Rally – DEBUG – result_dir: results/20160822-072227/cinder/create-attach-volume-cirros
    2016-08-22 07:25:37,742 – browbeat.Rally – DEBUG – test_name: 20160822-072227-browbeat-create-attach-volume-cirros-2-iteration-0
    2016-08-22 07:25:37,742 – browbeat.Rally – DEBUG – ——————————–
    2016-08-22 07:28:48,796 – browbeat.Rally – ERROR – Cannot find task_id

    Like

  3. I have image and flavour set as per my platform but I get this error:

    (rally)root@rally:~/browbeat# cat results/20160822-151341/plugins/netcreate-boot/20160822-151341-browbeat-netcreate-boot-8-iteration-0.log
    Option “verbose” from group “DEFAULT” is deprecated for removal. Its value may be silently ignored in the future.
    2016-08-22 15:31:18.392 5329 RALLYDEBUG rally.cli.cliutils [-] INFO logs from urllib3 and requests module are hide.
    2016-08-22 15:31:18.392 5329 RALLYDEBUG rally.cli.cliutils [-] urllib3 insecure warnings are hidden.
    2016-08-22 15:31:18.392 5329 RALLYDEBUG rally.cli.cliutils [-] ERROR log from boto module is hide.
    2016-08-22 15:31:18.392 5329 INFO rally.common.plugin.discover [-] Loading plugins from directories rally/rally-plugins/netcreate-boot/*
    2016-08-22 15:31:18.396 5329 WARNING rally.common.plugin.discover [-] Failed to load module with plugins rally/rally-plugins/netcreate-boot/netcreate_boot.py: ‘module’ object has no attribute ‘set’
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover [-] ‘module’ object has no attribute ‘set’
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover Traceback (most recent call last):
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover File “/opt/rally/local/lib/python2.7/site-packages/rally/common/plugin/discover.py”, line 80, in load_plugins
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover imp.load_module(plugin, fp, pathname, descr)
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover File “rally/rally-plugins/netcreate-boot/netcreate_boot.py”, line 11, in
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover scenario.Scenario):
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover File “rally/rally-plugins/netcreate-boot/netcreate_boot.py”, line 12, in NeutronPlugin
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover @types.set(image=types.ImageResourceType,
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover AttributeError: ‘module’ object has no attribute ‘set’
    2016-08-22 15:31:18.396 5329 ERROR rally.common.plugin.discover

    Also I don’t see anything in elasticsearch in the log. ES is working:
    (rally)root@rally:~/browbeat# curl http://IP:9200/_cat/indices
    yellow open .kibana 1 1 3 0 12.8kb 12.8kb
    yellow open data-2016.08.22 5 1 38 0 39.7kb 39.7kb
    The index contains some test messages sent by me with filebeat+logstash.

    Is there any log regarding Elastic.py ?

    Like

  4. Could you be using a older version of our code? We recently switched from using set, to convert:

    https://github.com/openstack/browbeat/blob/master/rally/rally-plugins/netcreate-boot/netcreate_boot.py#L24

    Also, the Elastic code is very lightweight. Have you had a Rally Scenario succeed? So you are seeing zero results hit your Elastic instance?

    You should see a index browbeat-rally-YYYY.MM.DD

    Joe

    Also we are on #openstack-browbeat on IRC!

    Like

  5. Strange I git pulled last friday…
    yeah I had some passes:

    (rally)root@rally:~/browbeat# cat results/20160822-151341.report
    Browbeat Report Card
    Rally:
    authenticate:
    tests:
    – Test name: authentic-keystone-64-iteration-0
    Time: 12.184587001800537
    status: pass
    cinder:
    tests:
    – Test name: create-attach-volume-cirros-2-iteration-0
    Time: 627.0361371040344
    status: pass
    keystonebasic:
    tests:
    – Test name: create-and-list-users-10-iteration-0
    Time: 19.010510206222534
    status: pass
    neutron:
    tests:
    – Test name: create-list-port-8-iteration-0
    Time: 40.29532790184021
    status: pass
    nova:
    tests:
    – Test name: boot-snapshot-delete-8-iteration-0
    Time: 272.20582914352417
    status: pass
    – Test name: boot-list-cirros-8-iteration-0
    Time: 14.306036949157715
    status: fail
    plugins:
    tests:
    – Test name: netcreate-boot-8-iteration-0
    Time: 0.9705228805541992
    status: fail
    – Test name: subnet-router-create-8-iteration-0
    Time: 0.9785258769989014
    status: fail

    I now changed the code with your new one. I’m trying to run now but I keep getting Service Unavailable (HTTP 503), I think my keystone has some issues :(.
    facepalm on elastic it was set on false :(( trully sorry for that…. I’ll get back to you ASAP.

    Like

  6. Hi guys,
    I managed to make it work with elastic with minor tweaks to Rally.py and elastic.py to ignore the metadata_files: because I don’t use them and from what I saw they are mostly used by ansible.
    How did you manage to get the “max raw” and “min raw” in kibana in your example? Or is it simpler to do it in grafana ? I would prefer grafana 🙂
    I checked all the stuff in filebeat-dashboards.zip and there is nothing about browbeat-* index.. is about logstash-*

    Like

  7. I’m not using ansible to monitor the hosts on which openstack resides. I prefer collectd+prometheus+grafana. I use browbeat only for openstack tests.
    I bypass the metadata like this:
    a “try” after the combine_metadata in lib/Elastic.py
    and a:
    except : pass
    return result
    at the end.

    I need to have a look at how I can remove the metadata from your visualization 😦 or create from scratch in grafana.
    Is it possible to put into the json that you push in elastic the fields from table: “Response Times” from the iteration-0.log ?

    Like

  8. Sorry for double post, don’t get it why it did that:

    One question:
    Is it possible to put into the json that you push in elastic the fields from table: “Response Times” from the iteration-0.log ?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s