pod01

\[Optional\] Lab Task 5: Scenario 2 - Closed Loop Automation with Splunk

In this hands-on learning lab, you will explore how to leverage Splunk and Closed-Loop Automation to enhance network observability and automate incident response.

**Use Case: Automated QoS Policy Deployment Based on Network Telemetry

**

In this use-case, telemetry data is continuously collected from switches to monitor traffic patterns. If bandwidth congestion is detected, an automated remediation process is triggered.

Workflow:

  1. Data Generation: After configuring model driven telemetry, the Catalyst 8000V transmits telemetry data, including interface bandwidth utilization, to Splunk.
  2. Monitoring & Detection: Splunk analyzes the data. If an interface exceeds 5 Mbps of transmitted data, it flags congestion.
  3. Automation Trigger: Upon detecting congestion, Splunk sends an alert to GitLab CI/CD via a webhook.
  4. Remediation: A GitLab pipeline is executed, pushing a configuration change to the affected switch, applying a QoS policy limiting the bandwith.

By the end of this lab, you will have a complete closed-loop automation setup, integrating Splunk’s analytics capabilities with Cisco’s Model-Driven Telemetry (MDT) and automated remediation workflows.

Credentials

ControllerDNSUsername (e.g. pod01)Password
Splunkhttps://198.18.133.50:8000pod<pod-number>Cisco123!
Telegraf198.18.134.22

Introduction to Splunk & Closed Loop Automation

Splunk

Splunk (now part of Cisco) is a powerful data analytics platform that enables organizations to search, monitor, and analyze machine-generated data. It is widely used in network engineering to provide insights into network performance, security threats, and operational health.

  • Real-time Monitoring: Collects and processes data from multiple sources in real-time.

  • Security Insights: Detects anomalies and security threats.

  • Troubleshooting & Analytics: Helps engineers troubleshoot and optimize network performance.

  • Automation & Alerts: Automates responses to network events and triggers alerts.

Data Flow in Splunk

The high-level data flow can be divided into 3 parts. At first the data in ingested from various sources, automatically indexed and then users can do their search and analysis via the dashboard.

  1. Data Ingestion: Forwarders collect data from logs, syslogs, SNMP traps, APIs, or cloud sources.
  2. Indexing: Data is parsed, structured, and stored in indexers.
  3. Search & Analysis: Users interact via the search head to query, visualize, and alert on data.
Closed Loop Automation

Closed-loop automation is a system that continuously monitors and adjusts processes based on real-time data, minimizing human intervention. It is widely used in networking, IT operations, and industrial automation to enhance efficiency, resilience, and self-healing capabilities.

Step 1: Splunk Basics - Accessing Metrics

In essence, Splunk is used for log and event management. It helps analyze, monitor, and visualize data from various sources, including network devices.

Splunk deals primarily with events, which are timestamped records of activities or changes that occur in a system.

However, it can also deal with metrics. In this scenario we are only dealing with metrics from the networking device.

Events vs. Metrics

  • Events: Represent discrete occurrences, such as a syslog message indicating an interface down or an authentication failure.

  • Metrics: Represent numerical data over time, such as CPU utilization or bandwidth usage.

Search & Reporting Application

Splunk offers various methods to analyze and visualize network telemetry data. Several applications can be installed on top of the core Splunk platform to provide deeper insights into the underlying data.

One essential application is the Search & Reporting application, which lets you search your data, create data models and pivots, save your searches and pivots as reports, configure alerts, and create dashboards. This app is provided by default.

Let’s start (or even continue) your Splunk journey:

  1. Log in to the Splunk Web Interface.

  2. You should already see the Search & Reporting Application as this is set as your default application.

  3. If you click on the top left corner the menu item Apps ⬇️ , a drop-down menu will show you all installed applications which can be used to interact with the collected data.
    In our scenario, we will only work with the Search & Reporting Application.

Try out Splunk’s search capabiltities

Since our Splunk instance does not yet contain external data, we can test the search bar by displaying the first events from the instance itself.

Copy and paste this command into the search bar and you will see 10 events from the Splunk Daemon access logs for example. Feel free to explore the different filters for now.

markup
POD01
index=_internal | head 10

Step 2: Configuring Model-Driven Telemetry to Telegraf

Now, let’s send telemetry data from our switches to Splunk. A common architecture for this scenario is to use a message data collector like Telegraf.

Therefore, we collect at first our telemetry data from IOS XE & IOS XR to Telegraf with model-driven telemetry. From there, we forward the important metrics to Splunk.

About Cisco Model Driven Telemetry (MDT) Cisco’s Model-Driven Telemetry (MDT) is a streaming-based telemetry framework that enables network devices to continuously push structured data to a collector. Instead of periodically polling devices for data, MDT allows for real-time monitoring and analysis, reducing network overhead and improving observability.

Luckily we already pushed our telemetry configuration to our Nexus und Catalyst switches. But let’s double check if the configuration was applied correctly and if the connection to our Telegraf service has been established:

IOS XE

Connect to your Catalyst switch via CML and run these commands.

As you will see, we are sending several metrics out:

  • Statistics of each interface
  • Memory utilization of the device
  • CPU utilization of the device
bash
POD01
show run | begin telemetry
show telemetry connection all
IOS XR

Connect to your Nexus switch via CML and run these commands.

As you will see, we are only sending the metrics of all interfaces out on the Nexus switch.

bash
POD01
show running-config | section telemetry
show telemetry transport all
Telegraf.conf

Telegraf is a free server-based agent for collecting and sending all metrics and events from databases, systems, and IoT sensors.

Using Telegraf is simple. You just need to install it as a service and run it with your configuration.

Our telegraf.conf configuration file is shown below.

The configuration is just an additional information to you. No action is required on your part, as this service is already up and running.

bash
POD01
1[[inputs.cisco_telemetry_mdt]]
2transport = "grpc"
3service_address = ":57400"
4
5# Splunk HTTP Event Collector (HEC) Output Plugin
6[[outputs.http]]
7url = "https://198.18.133.50:8088/services/collector"
8data_format = "splunkmetric"
9insecure_skip_verify = true
10splunkmetric_hec_routing = true
11splunkmetric_multimetric = false
12splunkmetric_omit_event_tag = true
13
14# Splunk HEC Token
15[outputs.http.headers]
16Content-Type = "application/json"
17Authorization = "Splunk <token>"
18X-Splunk-Request-Channel = "<token>"

Step 4: Generating Network Traffic

Now that we have established a running pipeline, let’s see if we can see data coming into Splunk.
But before that, we need to generate network traffic in order to see data in a graph.

In order to generate data we will use iPerf, which is an open-source speed test and network performance measurement tool.
iPerf operates on a client-server model, requiring two devices to conduct a network performance test.

  • The server acts as a listener, waiting for incoming connections.
  • The client initiates the test by connecting to the server and sending traffic for measurement.

Let’s configure this on our Linux clients in CML:

Connect to datacenter-client01 via CML (Console) and execute the following command to start iPerf in server mode, listening on port 5102:

bash
POD01
datacenter-client01:~$ iperf -s -p 5201

Then, connect via CML (Console) to the branch-client01 and execute this command which will send data to the server with the stated IP address and port.

  • -c 10.0.1.10: destination IP address of the server
  • -t 1500: states that after 1500 seconds the client will stop sending data
  • -p 5201: is the desitnation port
  • -b 10M: Bandwidth
  • -i 1: you will receive statistics every second
bash
POD01
branch-client:~$ iperf -c 10.0.1.10 -p 5201 -t 1500 -b 10M -i 1

Step 4: Checking Incoming Data in Splunk Analytics Dashboard

Now we can check if that is arriving in Splunk. In the Search & Reporting application, go to the Analytics menu to see all incoming metrics.

Since all lab users are sending the data to the same Splunk instance, you are able to view all incoming data.

Therefore, let’s filter out the data which belongs to your POD.

Since we are interested in the throughput of the Interfaces G2 (to the firewall) and G3 (to the client) on our Catalyst Router, let’s filter out the tx data (transmitting) data on interface G2 for example. Let’s select the following as you see below:

  • tx in kbps
  • Aggregation to Max
  • We set another filter on the field name, to filter out the interface.

You should already see a nice short data curve which should reach 10k kbits within 10-15 minutes.

In the meanwhile, we can open the query in the search bar as well. For that, just click on the three dots within the diagram and click on Open in Search:

Side note: You could also save this visualization directly to a Splunk Dashboard and create/add/remove several visualizations. If you are interested and have enough time, feel free to explore this functionality and click on Save to Dashboard Studio.

For your information: 

With the mstats command you can analyze metrics. This command performs statistics on the measurement, metric_name, and dimension fields in metric indexes. You can use mstats in historical searches and real-time searches.

Define the time range: You can adjust the time range of the data which you are currently viewing. In the screenshot below you see the incoming data from the last 15 minutes.

Step 5: Preparing the Loop with Gitlab

Create a new playbook called 04_react-on-trigger.yml in the dnac/playbooks directory.

This playbook will update the interface and set the “qos5mbit" custom field to true.

This action will trigger the rule in the Jinja template to apply the “LIMIT-5MBPS" service policy on the interface, helping to regulate traffic flow.

04_react-on-trigger.yml
bash
POD01
1- name: Update qos5mbit custom field on GigabitEthernet3
2  hosts: localhost
3  connection: local
4  gather_facts: no
5  vars:
6    netbox_url: "http://198.18.134.22:9000"
7    api_token: "{{ lookup('env', 'api_token') }}"
8  tasks:
9    - name: Update interface custom field qos5mbit to true
10      netbox.netbox.netbox_device_interface:
11        netbox_url: "{{ netbox_url }}"
12        netbox_token: "{{ api_token }}"
13        data:
14          device: "{{ device_name }}"
15          name: "{{ interface_name }}"
16          custom_fields:
17            qos5mbit: true
18        state: present
19
20    - name: Show update result
21      debug:
22        msg: "Interface {{ interface_name }} on device {{ device_name }} updated with qos5mbit=true"

This pipeline comprises two stages: deploy_config_dnac and deploy_config_dnac_trigger.

Both stages utilize the cbeye592/ltrato-2600:dnac Docker image and require specific secrets from a Vault, authenticated via an ID token.

In the deploy_config_dnac stage, the pipeline activates a Python virtual environment, adjusts permissions for the dnac directory, navigates into it, and executes an Ansible playbook named 03_deploy-template.yml targeting the device POD01-CAT8KV-01. This stage runs only if the HOSTNAME_TRIGGER environment variable is unset or empty.

Conversely, the deploy_config_dnac_trigger stage performs similar initial steps but runs two Ansible playbooks:

04_react-on-trigger.yml with parameters for a specific device and interface, followed by 03_deploy-template.yml.

This stage executes when the HOSTNAME_TRIGGER variable is set.

.gitlab-ci.yml
bash
POD01
1stages:
2  - deploy_config_dnac
3
4deploy_config_dnac:
5  stage: deploy_config_dnac
6  tags:
7    - docker-runner
8  image: cbeye592/ltrato-2600:dnac
9  id_tokens:
10    VAULT_ID_TOKEN:
11      aud: https://198.18.133.99:8200
12  secrets:
13    DNAC_HOST:
14      vault: DNAC/DNAC_HOST@pod01
15      file: false
16      token: $VAULT_ID_TOKEN
17    DNAC_VERIFY:
18      vault: DNAC/DNAC_VERIFY@pod01
19      file: false
20      token: $VAULT_ID_TOKEN
21    DNAC_USERNAME:
22      vault: DNAC/DNAC_USERNAME@pod01
23      file: false
24      token: $VAULT_ID_TOKEN
25    DNAC_PASSWORD:
26      vault: DNAC/DNAC_PASSWORD@pod01
27      file: false
28      token: $VAULT_ID_TOKEN
29    api_token:
30      vault: NETBOX/api_token@pod01
31      file: false
32      token: $VAULT_ID_TOKEN
33  before_script:
34    - source /root/ansible/bin/activate
35    - chmod -R 700 dnac
36    - cd dnac
37  script:
38    - ansible-playbook -i hosts playbooks/03_deploy-template.yml --extra-vars "device_name=POD01-CAT8KV-01"
39  rules:
40    - if: '$HOSTNAME_TRIGGER == null || $HOSTNAME_TRIGGER == ""'
41
42deploy_config_dnac_trigger:
43  stage: deploy_config_dnac
44  tags:
45    - docker-runner
46  image: cbeye592/ltrato-2600:dnac
47  id_tokens:
48    VAULT_ID_TOKEN:
49      aud: https://198.18.133.99:8200
50  secrets:
51    DNAC_HOST:
52      vault: DNAC/DNAC_HOST@pod01
53      file: false
54      token: $VAULT_ID_TOKEN
55    DNAC_VERIFY:
56      vault: DNAC/DNAC_VERIFY@pod01
57      file: false
58      token: $VAULT_ID_TOKEN
59    DNAC_USERNAME:
60      vault: DNAC/DNAC_USERNAME@pod01
61      file: false
62      token: $VAULT_ID_TOKEN
63    DNAC_PASSWORD:
64      vault: DNAC/DNAC_PASSWORD@pod01
65      file: false
66      token: $VAULT_ID_TOKEN
67    api_token:
68      vault: NETBOX/api_token@pod01
69      file: false
70      token: $VAULT_ID_TOKEN
71  before_script:
72    - source /root/ansible/bin/activate
73    - chmod -R 700 dnac
74    - cd dnac
75  script:
76    - ansible-playbook -i hosts playbooks/04_react-on-trigger.yml --extra-vars "device_name=POD01-CAT8KV-01" --extra-vars "interface_name=GigabitEthernet3"
77    - ansible-playbook -i hosts playbooks/03_deploy-template.yml --extra-vars "device_name=POD01-CAT8KV-01"
78  rules:
79    - if: '$HOSTNAME_TRIGGER'

Step 6: Connecting the full Loop - Creating a Splunk Alert

Splunk Alerts notify when certain conditions are met, such as high interface utilization or security threats.

What Splunk Alerts Do
  • Trigger actions when specific conditions occur (e.g., CPU usage > 80%).
  • Send notifications via email, webhook, or script execution.
  • Automate incident response and network troubleshooting.
Let’s create our rule-based Alert

1. Go to Splunk and if you are not in Search already, then please go to Search.

2. Since we need to alter a bit our search query, please copy and paste the following mstats query into your search (replace any previous text!).

  • 1. line: With mstats max() as transmit_tx, we define the maximum value of interface tx field to the newly created variable with the name “transmit_tx“. We filter the index and aggregate the results by the hostname and name.
  • 2. line: Here we remove all meta-data fields from the query
  • 3. line: This is our rule-based filter which includes all fields based on the conditions. As you can see, we filter the data points based on the transmit_tx variable, which should be below 5000.

Be aware to set the hostname of your Catalyst router in your POD: e.g. POD****-CAT8KV-01

bash
POD01
| mstats max("Cisco-IOS-XE-interfaces-oper:interfaces/interface/statistics.tx_kbps") AS transmit_tx WHERE "index"="cisco_telemetry" span=10s BY hostname, name
| fields - _span*
| search transmit_tx > 5000 AND hostname="POD<pod-no>-CAT8KV-01" AND name="GigabitEthernet2"

3. When done, click on Save AsAlert and insert the following fields according to the screenshots:

We configure a scheduled alert where every 1 minute the condition will be assessed. If the number of the results is greater than 0, it means that on this interface G2 on our Catalyst the transmit_tx is higher than 5000.

We are also adding a Throttle and suppress triggering for 4 hours after receiving the first results.

Then we also need to add a Trigger Action, which is sending data to Gitlab.

Trigger Actions: In the alert creation form, we will select the application Better Webhook where we can send an alert with a custom payload to a webhook service.

1. URL: This is the target URL where Splunk would send the RESTful notification to. We would like to send it to our Gitlab project. For that, we need our project ID. Therefore, go to Gitlab und find in your project repository the project ID. Put into the URL respectively. 

bash
POD01
https://198.18.133.99/api/v4/projects/<your-project-id>/pipeline

2. Body Format: Copy and paste the payload from below. This is the required JSON format by Gitlab. In our case, we send the hostname as the unique identifier.

json
POD01
1{
2    "ref": "main",
3    "variables": [
4        {
5            "key": "HOSTNAME_TRIGGER",
6            "value": "$result.hostname$"
7        }
8    ]
9}

3. Credential: Gitlab Token. This is already provided and pre-configured.

5. After the alert has been saved, click on View Alert.

Step 7: Check if the Trigger works

Now that we have configured our closed-loop automation including the trigger, let’s test if everything works as expected. Maybe it got already triggered or it will be soon.

You can check on the branch_client01 the logs of iperf.

Or you can check on Splunk: Either via Analytics or your can run this search:

Be aware again to change the POD number accordingly.

bash
POD01
| mstats max("Cisco-IOS-XE-interfaces-oper:interfaces/interface/statistics.tx_kbps") prestats=true WHERE "index"="cisco_telemetry" AND "hostname"="POD<podno>-CAT8KV-01" AND "name"="GigabitEthernet2" span=10s
| timechart max("Cisco-IOS-XE-interfaces-oper:interfaces/interface/statistics.tx_kbps") AS Max span=10s
| fields - _span*