The Mobility Service

The mobility service is portion of the Virginia Tech network experience where end devices connect wirelessly. Today, this is limited to Wi-Fi, but is not necessarily so in the future.

About this documentation

Target Audience

This documentation is primarily internal to the team managing/supporting the mobility service. It is also a convenient way to share ideas outside of the core team, and to encourage open development.

Contributing/feedback

All contributions and feedback, on the documentation and the service itself, are welcome.

Administrative Unit Assessments

NI&S is required to set measurable goals on an annual basis. These are the goals for the wireless service for the year 2021.

Service Objective

Design and build robust and resilient IT infrastructure in support of Virginia Tech's expansion and growth.

Measure: indoor coverage

Track Wi-Fi coverage in indoor programmed spaces.

Target

Provide comprehensive indoor Wi-Fi coverage with at least -65dBm signal strength for all programmed spaces.

Progress

We already design for this. However, measuring it in a meaningful was is tricky. Some options include:

  1. Spot checking areas manually
  2. Deploy end-to-end testing beacons
  3. Count tickets

Each of these have their drawbacks. 1 and 2 are guaranteed to miss areas, especially corner cases that were missed in the design phase. 3 is not practical as it is not possible to programmatically extracting that data from tickets.

Measure: capacity

Track Wi-Fi capacity in indoor programmed spaces.

Target

Track Wi-Fi capacity to ensure a wireless client to access point ratio of 25:1 or better.

Progress

This is another tricky one to measure, mostly because our monitoring tools suck. AirWave and NetInsight are both going away in the next year(ish), and will be replaced with Central On-Prem (COP). We will re-evaluate when COP is deployed.

It is worth noting that clients per radio is perhaps not the best metric:

  • The 2.4GHz space is noisier than 5GHz, and as such would need a lower threshold for an equivalent experience.
  • A client connecting at 12Mbps uses more resources than a client connecting at 900Mbps.
  • A client streaming Netflix 4k uses more resources than an idle client.

Suggestion: let's track peak airtime utilization instead. We need to find what is a good target and how to measure this.

Measure: outdoor coverage

Track Wi-Fi coverage in outdoor spaces.

Target

Expand the number of outdoor wireless access points by 20%.

Progress

Funding for this is planned. We need to determine how many (in scope) outdoor APs exist now.

Administrative Objective

Increase organizational efficiency and responsiveness.

Measure: AP provisioning

Improve AP provisioning capabilities

Target

Automate the provisioning of all standard wireless access point installations.

Progress

This one is mostly done. See enhancement task.

Complete:

  • Work with devs to create a tool that will automatically provision an AP.
  • The tool has 2 modes:
    • REST API: on demand provisioning of an AP by passing a MAC address
    • Nightly reconciler: in case something on demand is missed (or the environment changes).
  • Pulls info from the controller and LLDP to determine the AP group and name.
  • Creates an AP group if none exists.
  • A script is written to parse an SNMP trap and fire off the REST call.
  • MDs currently send traps via IPv4 to OMD (stonefly).

Incomplete:

  • OMD is not correctly processing the traps it receives.
  • Moving from OMD to AKIPS now that it is purchased.
  • Rather than managing the SNMP/REST connector ourselves, we would rather have the web app listen to SNMP directly.
  • Documentation

Service Priorities

These are thoughts on what we prioritize or value for the Wi-Fi service specifically. This is not the core values set in the IT Strategic Plan, but is (in part) an extension of them, as it applies to specifically the Wi-Fi service. This is closely related to the AUAs, but is more general / broadly scoped in nature.

Key insights

  • Our users set the expectations for the service.
  • Our priorities are defined by the expectations.
  • The priorities drive the features and properties of the system.
  • All of this is constrained and guided by our core values.

What this looks like for Wi-Fi

Expectations

  1. Ubiquitous and seamless coverage
  2. Reliable access
  3. High bandwidth
  4. Reasonable latency

Priorities

  1. Robust
    1. Systems stay up
    2. Systems continue to provide service when they are up
    3. Systems are fault tolerant
    4. Debugable
  2. Flexible
    1. Well understood and solved problems should be solved out of the box
    2. The system should easily accommodate new ideas or deployments not considered by the vendor.
  3. Secure
  4. Coherent architecture

Properties

User end

  1. Frictionless access
  2. Latest standards
  3. Dual stacked now
  4. Single stack IPv6 limitations are external

Administrator end

  1. Fault tolerant
    1. Hardware should be able to fail without impacting users
    2. Replacing hardware should be low risk and (relatively) low effort
  2. API driven
    1. Complete
    2. Idempotent
  3. Single stacked IPv6
    1. IPv6 Addresses are not strings
  4. Fixing one part of the system should not depend another part of the system
  5. Centralized (or perhaps intent based) config (within what is allowed due to above)
  6. Integrated monitoring
  7. Observable
    1. Is the system itself healthy?
    2. What is the user facing status?
    3. Do I have the tools to see what is going on under the covers?
    4. Do I have tools to identify an unknown unknown problem?
  8. Configuration that is difficult (preferably impossible) to get out of sync.
  9. (Flexible)
    1. Sane defaults
    2. Extensive options
    3. Building-block config
    4. Clear and consistent mental model to the config
  10. Split control plane and data plane
  11. OOB access
  12. Key/cert based access
  13. Idempotent API/Config
  14. Auditable config
  15. Usable config system
  16. Easily spun up
    1. Lab purposes
    2. Ransomware recovery
  17. Life cycle
    1. Replacing hardware
    2. Hardware available
    3. Can the vendor do business?
      1. Is the vendor able to ship hardware?
      2. Can the vendor tell us how much we owe in support (in a reasonable time frame)?
      3. Can we predict how much we owe and what items we need?

Other notes / questions

  1. Ask for a packet walkthrough

See also

Services

This is a collection of the different ways a device can connect to the wireless network.

Full documentation is a work in progress, but for now, it includes high level information on the authentication used and mechanisms available to protect the network from a misbehaving device.

eduroam

eduroam is the primary wireless network at Virginia Tech.

Authentication

Virginia Tech users are authenticated with PEAP/MSCHAPv2. Because this is a thoroughly broken protocol, these credentials are used only for network authentication.

Network

All users, VT affiliates and roaming users on VT's campus, land in vlan-users.

Remediation

We can remove a user or device from the network in two ways.

  1. Disable the credentials
  • VT accounts can have the network entitlement removed, effectively revoking their authorization.
  • By design, VT is unable to see the individual usernames for roaming users (e.g., a Radford user on VT's campus). We can, however, see what institution their account is from. Therefore, to revoke access, we need to access the user's home institution. Since this is a process that can take some time and is not within our control, we can also block ALL authentication for that institution.
  1. Block the MAC address.
  • This must be entered on each controller.
  • The controller then denies all 802.11 authentication requests from that MAC, which prevents the device from even associating.
  • This is becoming less effective as MAC randomization is increasing.

VT Open WiFi

The VT Open WiFi SSID is an open network with no captive portal.

This network should be used by devices that cannot or should not use eduroam. The main reasons for this are:

  • The device cannot do 802.1X authentication (game consoles, Chromecasts, etc).
  • The device belongs to a group (e.g., department) rather than an individual, and thus does not have eduroam credentials.
  • The user is a guest (and has no eduroam IdP)

Authentication

Users can connect and use the network with or without authentication. Only MAC auth is used, so no matter what, the client sees the network as an open unauthenticated network. Currently, auth is handled by ClearPass, but will soon be an instance of FreeRADIUS.

Devices can be registered in the NIS Portal. Devices can be registered as a personal device or organizational device. Any registered device is put in the Authenticated network; all other devices are in the unauthenticated network.

Quarantine

This is not yet implemented, and is subject to change. Currently, if we need to prevent a device from connecting, it is blocked by MAC address on the controller.

The database backing the FreeRADIUS authentication will include a list of banned MAC addresses. Any device connecting with a banned MAC is placed in the unauthenticated network, irrespective of registration, and is put behind a captive portal.

This captive portal will be a static page without a network login. Instead, it will display a message saying the device has been blocked and that the user should contact 4Help.

Networks

Authenticated

Authenticated devices land in the same network as eduroam users and have no restrictions. Some service owners restrict access to on campus networks, such as this one.

Devices get an RFC 1918 IPv4 address and a globally routed IPv6 address.

Unauthenticated

Unauthenticated devices land in the guest VRF. Devices get a CG-NAT (100.64.0.0/10) IPv4 address and a globally routed IPv6 address. This traffic is hair-pinned at the border and is effectively treated as Internet traffic.

There are no network ACLs artificially limiting access. However, there are services that require being connected to an "on campus" network to use them, which the unauthenticated network is not. Some services that do not work from the unauthenticated network include:

  • Zoom rooms
  • Digital key access for physical doors

Non-standard networks

These networks are not part of the "Virginia Tech network experience". They are deployed as work-arounds on an as-needed basis. The hope is that as the wireless service grows/evolves, these on-offs will go away.

VT_TIX

This network exists to get the wireless ticket scanners online for athletics. Because we do not have the proper RF coverage in and around the stadium, these devices cannot use the standard networks, as thousands of other clients would also try to associate, choking out the scanners.

Network

vlan-users

Special considerations

  • APs with the VT_TIX network on them have only the VT_TIX network on them. This limits the use of the airspace.
  • The network is hidden to prevent devices from automatically associating.

Long term plan

We need a full deployment of APs in and around the stadium (and other athletic areas) that will support the 65,000+ people who are there for game day. Once we have this, the scanners can use the registered device service.

Authentication

None.

Remediation

  • If possible, just shut down the network (disable the virtual-ap profile).
  • If it is not a good time for that, block the MAC on the controller.

VTEvent

Overview

VTEvent is for one-off events. It can be used to get event staff or users online. It is an open network. It can be a hidden or visible network. There is also an AP group for rapid deploy units.

Scope and Purpose

Often, there are events on campus where the standard networks are not suitable. VTEvent fills this gap. Deployments have a fixed start and end date/time.

Hidden Example

For example, during Relay4Life, the support staff needs a network in the Drillfield. Adding the VT Open WiFi network would not be suitable, as a rapid deployment unit would not be suitable for the density of clients. In this example, the hidden version of VTEvent should be deployed.

Visible example

Another example would be the SANS and VT-Hacks events, where the attendees need to get all manner of hackerish and IoT devices online. Normally, the registered device service would suit, but since this is an event, we cannot guarantee everyone is a VT affiliate with access to the registered device service. In this example, the visible version of VTEvent should be deployed.

Support

This is a very simple service. There is no authentication. The VLAN/subnet is the same as is used for the other wireless services (vlan-users). As such, the most likely place for something to go wrong is communication. Here are a few cases where something is most likely to go awry.

Hidden SSID

The SSID may be hidden. If so, the customer will need to type in the SSID exactly correct, case sensitive. There is no punctuation or any unexpected characters. For reference, the SSID is VTEvent

Limited time

Is it before the event started? Is the event over? If so, the network may not be broadcasting. It goes up and down at the times agreed upon by the customer and NEO.

Down APs

Unlike the other two examples, this one is a technical issue, not a communications issue. Usually, if the network is hidden, it will be on its own APs. As such, the problem may not be as obvious as normal. If the network is visible, it is probably broadcast from the same APs as the standard networks. This should help in determining if the APs are up.

Deployment and Cleanup

The nature of these events is that they are one offs, so it is easy to miscommunicate or leave cruft.

Communicate with the customer

Be sure to communicate with the customer what the name of the SSID is, and if the network is hidden. If deploying the hidden version, the customer will need to type it in, so be verbose. Remember that SSIDs are case sensitive!

Don't do it manually

Doing it manually is a sure way to forget the cleanup. Use an existing tool to deploy and cleanup the service. NetMRI is an excellent choice.

Create an ECO

Create an ECO for when the service is deployed and keep it open until it has been removed. Double check the tool really did clean up the service before closing the ECO.

AOS config

aaa profile

Both variations use the aaa profile aaa-open. This is has no layer 2 nor layer 3 authentication. The VLAN is undefined (it is set by the virtual-ap profile).

ssid-profile

There are two SSID profiles:

  • ssid-VTEvent
  • ssid-VTEvent-hidden Both use the ESSID VTEvent, with the normal data rates used elsewhere. The only difference is that ssid-VTEvent-hidden is hidden.

virtual-ap profile

There are two virtual-ap profiles:

  • vap-VTEvent
  • vap-VTEvent-hidden Again, they are exactly the same, except vap-VTEvent-hidden uses ssid-VTEvent-hidden. Both have no authentication (layer 2 nor layer 3), and use the normal wireless VLAN.

ap-group

There are two AP groups for rapid deployment:

  • apg-vtevent
  • apg-vtevent-hidden

The only virtual-ap configured is the appropriate VTEvent virtual-ap. These AP groups are optimized for outdoor use (see the config below for details).

Configuration

/md/vt

{
  "aaa_prof": [
    {
      "default_user_role": {
        "role": "ur-open"
      },
      "profile-name": "aaa-open"
    }
  ],
  "ssid_prof": [
    {
      "a_basic_rates": {
        "12": "12"
      },
      "a_beacon_rate": {
        "a_phy_rate": "12"
      },
      "a_tx_rates": {
        "12": "12",
        "18": "18",
        "24": "24",
        "36": "36",
        "48": "48",
        "54": "54"
      },
      "advertise_ap_name": {},
      "essid": {
        "essid": "VTEvent"
      },
      "g_basic_rates": {
        "12": "12"
      },
      "g_beacon_rate": {
        "g_phy_rate": "12"
      },
      "g_tx_rates": {
        "12": "12",
        "18": "18",
        "24": "24",
        "36": "36",
        "48": "48",
        "54": "54"
      },
      "max_clients": {
        "max-clients": 150
      },
      "mcast_rate_opt": {},
      "profile-name": "ssid-vtevent"
    },
    {
      "a_basic_rates": {
        "12": "12"
      },
      "a_beacon_rate": {
        "a_phy_rate": "12"
      },
      "a_tx_rates": {
        "12": "12",
        "18": "18",
        "24": "24",
        "36": "36",
        "48": "48",
        "54": "54"
      },
      "advertise_ap_name": {},
      "deny_bcast": {},
      "essid": {
        "essid": "VTEvent"
      },
      "g_basic_rates": {
        "12": "12"
      },
      "g_beacon_rate": {
        "g_phy_rate": "12"
      },
      "g_tx_rates": {
        "12": "12",
        "18": "18",
        "24": "24",
        "36": "36",
        "48": "48",
        "54": "54"
      },
      "hide_ssid": {},
      "max_clients": {
        "max-clients": 150
      },
      "mcast_rate_opt": {},
      "profile-name": "ssid-vtevent-hidden"
    }
  ],
  "virtual_ap": [
    {
      "aaa_prof": {
        "profile-name": "aaa-open"
      },
      "drop_mcast": {},
      "profile-name": "vap-vtevent",
      "ssid_prof": {
        "profile-name": "ssid-vtevent"
      },
      "vlan": {
        "vlan": "vlan-user"
      }
    },
    {
      "aaa_prof": {
        "profile-name": "aaa-open"
      },
      "drop_mcast": {},
      "profile-name": "vap-vtevent-hidden",
      "ssid_prof": {
        "profile-name": "ssid-vtevent-hidden"
      },
      "vlan": {
        "vlan": "vlan-user"
      }
    }
  ]
}

/md/vt/swva

{
  "ap_a_radio_prof": [
    {
      "eirp_max": {
        "eirp-max": 127
      },
      "eirp_min": {
        "eirp-min": 127
      },
      "profile-name": "rpa-outdoor"
    }
  ],
  "ap_g_radio_prof": [
    {
      "eirp_max": {
        "eirp-max": 127
      },
      "eirp_min": {
        "eirp-min": 127
      },
      "profile-name": "rpg-outdoor"
    }
  ],
  "ap_group": [
    {
      "dot11a_prof": {
        "profile-name": "rpa-outdoor"
      },
      "dot11g_prof": {
        "profile-name": "rpg-outdoor"
      },
      "profile-name": "apg-vtevent",
      "reg_domain_prof": {
        "profile-name": "rdp-blacksburg"
      },
      "virtual_ap": [
        {
          "profile-name": "vap-vtevent"
        }
      ]
    },
    {
      "dot11a_prof": {
        "profile-name": "rpa-outdoor"
      },
      "dot11g_prof": {
        "profile-name": "rpg-outdoor"
      },
      "profile-name": "apg-vtevent-hidden",
      "reg_domain_prof": {
        "profile-name": "rdp-blacksburg"
      },
      "virtual_ap": [
        {
          "profile-name": "vap-vtevent-hidden"
        }
      ]
    }
  ]
}

Locally bridged networks

Some locations where we have deployed remote access points (RAPs), we want the traffic to stay local to where the AP is instead of coming back to campus. These virtual AP profiles use the -bridged suffix.

Currently, this is only the case at GCAPS, where the local network is managed by VTTI, not central IT.

Deployment Info

Domain

VT's deployment of the Aruba Mobility system uses the domain mobility.nis.vt.edu. All hostnames are relative to this domain. For example, the hostname foo has the FQDN foo.mobility.nis.vt.edu and the hostname foo.dev has the FQDN foo.dev.mobility.nis.vt.edu.

Configuration Hierarchy

Design

/
├── mm
│   └── mynode
└── md
    └── [org]
        └── [region]
            └── [cluster]
                └── [device]
  • /, /mm, /md, and /mm/mynode are created by the system and cannot be removed
  • / and /md should never be modified

Implementation

/
├── mm
│   ├── isb-mm-1
│   └── isb-mm-2
└── md
    └── vt
        ├── swva
        │   ├── bur
        │   │   ├── bur-md-1
        │   │   ├── bur-md-2
        │   │   ├── bur-md-3
        │   │   └── bur-md-4
        │   ├── col
        │   │   ├── col-md-1
        │   │   ├── col-md-2
        │   │   ├── col-md-3
        │   │   └── col-md-4
        │   ├── res
        │   │   ├── res-md-1
        │   │   ├── res-md-2
        │   │   ├── res-md-3
        │   │   └── res-md-4
        │   └── vtc
        │       ├── vtc-md-1
        │       └── vtc-md-2
        └── nova
            └── equinix
                ├── equinix-md-1
                └── equinix-md-2

Configuration prefixes

Configuration ItemPrefixConfiguration tier
aaa authentication captive-portalcp-org
aaa authentication dot1xdot1x-org
aaa authentication macmac-org
aaa authentication-server radiusasr-<server>-<service>mm/org
aaa profileaaa-org
aaa server-groupsg-mm/org
ap regulatory-domain-profilerdp-region
ap-groupapg-region
ip access-list session (allows)acl-allow-org
ip access-list session (denies)acl-deny-org
ip access-list session (mixed/captive)acl-control-org
lcc-cluster group-profilelcc-cluster
mgmt-server profilems-cluster
netdestination6nd6-org
netdestinationnd-org
rf arm-profilearm-region
rf dot11-6GHz-radio-profilerp6-region
rf dot11a-radio-profilerpa-, rp5-region
rf dot11g-radio-profilerpg-, rp2-region
user-roleur-org
vlan-namevlan-org
wlan he-ssid-profilehessid-org
wlan ht-ssid-profilehtssid-org
wlan ssid-profilessid-org
wlan virtual-apvap-org

Production

Mobility Conductors

The devices formerly known as "Mobility Masters" (MMs).

Physical

These are in the process of being phased out.

  • model: hw-mm-10k
  • vlan: 100
  • VRRP ID 1
HostnameSerialMACIPv4IPv6
isb-mm128.173.32.362607:b400:2:2000:0:173:32:36
isb-mm-1TWK7K3503H20:4c:03:8f:53:1a128.173.32.34/272607:b400:2:2000:0:173:32:34/64
isb-mm-2TWF5K350V320:4c:03:0e:e0:44128.173.32.35/272607:b400:2:2000:0:173:32:35/64

Virtual

NOTEThe IPv4 address listed here are reserved, but not used
  • Model: MM-VA-10K
  • VLAN: 115
  • VRRP ID: 20
HostnameProduct key#IPv4IPv6
mm198.82.169.2292001:468:c80:210f:0:175:c1d7:3214
mm-1?198.82.169.230/242001:468:c80:210f:0:18d:616:29ba/64
mm-2?198.82.169.231/242001:468:c80:210f:0:179:c946:7349/64

Mobility Devices (MDs)

Burruss

Management

HostnameSerialMACIPv4IPv6
bur-md-1DL000132800:1a:1e:03:01:98172.16.1.141/252607:b400:66:6000:0:16:1:141/64
bur-md-2DL000112200:1a:1e:02:d8:b0172.16.1.142/252607:b400:66:6000:0:16:1:142/64
bur-md-3DL000109900:1a:1e:02:d9:70172.16.1.143/252607:b400:66:6000:0:16:1:143/64
bur-md-4DL000132100:1a:1e:03:00:a8172.16.1.144/252607:b400:66:6000:0:16:1:144/64
  • Model: A7240XM
  • VLAN: 399
  • AP Discovery VRRP: 172.16.1.150
  • AP Discovery VRRPv6: 2607:b400:66:6000:0:16:1:150
  • Out of Band: bur-oob-01.oob.cns.vt.edu

Cluster

HostnameVRRP IDIPv4 VIPIPv6 VIP
bur-md-1220172.16.1.1512607:b400:66:6000:0:16:1:151/64
bur-md-2220172.16.1.1522607:b400:66:6000:0:16:1:152/64
bur-md-3220172.16.1.1532607:b400:66:6000:0:16:1:153/64
bur-md-4220172.16.1.1542607:b400:66:6000:0:16:1:154/64

vlan-guest

HostnameVLAN IDIPv4IPv6
bur-md-1800172.25.8.11/222607:b400:a00:0:0:25:8:11/64
bur-md-2800172.25.8.12/222607:b400:a00:0:0:25:8:12/64
bur-md-3800172.25.8.13/222607:b400:a00:0:0:25:8:13/64
bur-md-4800172.25.8.14/222607:b400:a00:0:0:25:8:14/64

vlan-user

HostnameVLAN IDIPv4IPv6
bur-md-11350172.29.0.11/172607:b400:26:0:29:0:11/64
bur-md-21350172.29.0.12/172607:b400:26:0:29:0:12/64
bur-md-31350172.29.0.13/172607:b400:26:0:29:0:13/64
bur-md-41350172.29.0.14/172607:b400:26:0:29:0:14/64

Coliseum

Management

HostnameSerialMACIPv4IPv6
col-md-1DL000112100:1a:1e:02:d8:90172.16.1.11/252607:b400:64:4000:0:16:1:11/64
col-md-2DL000135700:1a:1e:03:03:08172.16.1.12/252607:b400:64:4000:0:16:1:12/64
col-md-3DL000110600:1a:1e:02:d8:f0172.16.1.13/252607:b400:64:4000:0:16:1:13/64
col-md-4DL000136200:1a:1e:03:02:78172.16.1.14/252607:b400:64:4000:0:16:1:14/64
  • Model: A7240XM
  • VLAN: 299
  • AP Discovery VRRP: 172.16.1.20
  • AP Discovery VRRPv6: 2607:b400:64:4000:0:16:1:20
  • Out of Band: col-oob-05.oob.cns.vt.edu

Cluster

HostnameVRRP IDIPv4 VIPIPv6 VIP
col-md-1220172.16.1.212607:b400:64:4000:0:16:1:21/64
col-md-2220172.16.1.222607:b400:64:4000:0:16:1:22/64
col-md-3220172.16.1.232607:b400:64:4000:0:16:1:23/64
col-md-4220172.16.1.242607:b400:64:4000:0:16:1:24/64

vlan-guest

HostnameVLAN IDIPv4IPv6
col-md-1801172.25.16.11/222607:b400:a00:1:0:25:16:11/64
col-md-2801172.25.16.12/222607:b400:a00:1:0:25:16:12/64
col-md-3801172.25.16.13/222607:b400:a00:1:0:25:16:13/64
col-md-4801172.25.16.14/222607:b400:a00:1:0:25:16:14/64

vlan-user

HostnameVLAN IDIPv4IPv6
col-md-11250172.30.0.11/172607:b400:24:0:0:30:0:11/64
col-md-21250172.30.0.12/172607:b400:24:0:0:30:0:12/64
col-md-31250172.30.0.13/172607:b400:24:0:0:30:0:13/64
col-md-41250172.30.0.14/172607:b400:24:0:0:30:0:14/64

Residential

Management

HostnameSerialMACIPv4IPv6
res-md-1DL000136500:1a:1e:03:00:d8172.17.1.11/242607:b400:64:ba00:0:17:1:11/64
res-md-2DL000131900:1a:1e:03:01:90172.17.1.12/242607:b400:64:ba00:0:17:1:12/64
res-md-3DL000138700:1a:1e:03:11:10172.17.1.13/242607:b400:64:ba00:0:17:1:13/64
res-md-4DL000141700:1a:1e:03:0f:f8172.17.1.14/242607:b400:64:ba00:0:17:1:14/64
  • Model: A7240XM
  • VLAN: 3199
  • AP Discovery VRRP: 172.17.1.20
  • AP Discovery VRRPv6: 2607:b400:64:ba00:0:17:1:20
  • Out of Band: col-oob-05.oob.cns.vt.edu

Cluster

HostnameVRRP IDIPv4 VIPIPv6 VIP
res-md-1220172.17.1.212607:b400:64:ba00:0:17:1:21/64
res-md-2220172.17.1.222607:b400:64:ba00:0:17:1:22/64
res-md-3220172.17.1.232607:b400:64:ba00:0:17:1:23/64
res-md-4220172.17.1.242607:b400:64:ba00:0:17:1:24/64

vlan-guest

HostnameVLAN IDIPv4IPv6
res-md-1802172.25.24.11/222607:b400:a00:10:0:25:28:11/64
res-md-2802172.25.24.12/222607:b400:a00:10:0:25:28:12/64
res-md-3802172.25.24.13/222607:b400:a00:10:0:25:28:13/64
res-md-4802172.25.24.14/222607:b400:a00:10:0:25:28:14/64

vlan-user

HostnameVLAN IDIPv4IPv6
res-md-13200172.31.0.11/172607:b400:b4:1800:0:31:0:11/64
res-md-23200172.31.0.12/172607:b400:b4:1800:0:31:0:12/64
res-md-33200172.31.0.13/172607:b400:b4:1800:0:31:0:13/64
res-md-43200172.31.0.14/172607:b400:b4:1800:0:31:0:14/64

Equinix

Management

HostnameSerialMACIPv4IPv6
equinix-md-1BB000105800:1a:1e:00:14:3045.3.106.2/242607:b400:803:0:0:3:106:2/64
equinix-md-2BB000196400:1a:1e:00:99:7045.3.106.3/242607:b400:803:0:0:3:106:3/64
  • Model: A7220
  • VLAN: 2701
  • AP Discovery VRRP: N/A
  • AP Discovery VRRPv6: N/A
  • Out of Band: nvc-pbx-zpe.oob.vtnis.net

Cluster

HostnameVRRP IDIPv4 VIPIPv6 VIP
equinix-md-122045.3.106.42607:b400:803:0:0:3:106:4
equinix-md-222045.3.106.52607:b400:803:0:0:3:106:5
  • Authenticated vlan: 2700
  • Unauthenticated vlan: 808

VTC

Management

HostnameSerialMACIPv4IPv6
vtc-md-1DL000336900:1a:1e:04:b1:10172.16.247.11/232607:b400:62:1400:0:16:247:11/64
vtc-md-2DL000337700:1a:1e:04:b1:18172.16.247.12/232607:b400:62:1400:0:16:247:12/64
  • Model: A7240XM
  • VLAN: 100
  • AP Discovery VRRP: 172.16.247.20
  • AP Discovery VRRPv6: 2607:b400:0062:1400:0:16:247:20/64
  • Out of Band: vtc-oob-01.oob.cns.vt.edu

Cluster

HostnameVRRP IDIPv4 VIPIPv6 VIP
vtc-md-1220172.16.247.212607:b400:62:1400:0:16:247:11/64
vtc-md-2220172.16.247.222607:b400:62:1400:0:16:247:12/64

vlan-user

HostnameVLAN IDIPv4IPv6
vtc-md-11750172.20.24.2/222607:b400:2e:0:0:30:128:11/64
vtc-md-21750172.20.24.3/222607:b400:2e:0:0:30:128:12/64

vlan-guest

HostnameVLAN IDIPv4IPv6
vtc-md-1811172.25.48.11/232607:b400:a02:0:0:25:48:11/64
vtc-md-2811172.25.48.12/232607:b400:a02:0:0:25:48:12/64

Carilion networks

HostnameCarilion-AppNetCarilion-Wireless-WPA
VLAN ID327305
vtc-md-1172.16.185.3/24172.16.226.3/23
vtc-md-2172.16.185.4/24172.16.226.4/23

Dev

Mobility Conductors

NOTEThe IPv4 address listed here are reserved, but not used
  • Model: MM-VA-500
  • VLAN 115
  • VRRP ID 239
HostnameProduct key#IPv4IPv6
mm.dev198.82.169.2322001:468:c80:210f:0:133:6fe8:c4ef
mm-1.devMM603F362198.82.169.233/242001:468:c80:210f:0:15c:3ecf:1a84/64
mm-2.devMM2D6D975198.82.169.234/242001:468:c80:210f:0:1d2:4cad:7ff7/64

Mobility Devices

Coliseum

In band Management

HostnameSerialMACModelIPv6
col-md-5.devBB000213100:1a:1e:00:ab:38A72202607:b400:64:4000:0:16:1:15/64
col-md-6.devBB000250500:1a:1e:00:be:00A72202607:b400:64:4000:0:16:1:16/64
  • VLAN: 299
  • AP Discovery VRRP: 172.16.1.19
  • AP Discovery VRRPv6: 2607:b400:64:4000:0:16:1:19

OOB Management

HostnameIPv6
col-md-5.dev2607:b400:e1:4000:0:0:0:15/64
col-md-6.dev2607:b400:e1:4000:0:0:0:16/64
col-md-7.dev2607:b400:e1:4000:0:0:0:17/64

Cluster

HostnameVRRP IDIPv4 VIPIPv6 VIP
col-md-5.dev219172.16.1.252607:b400:64:4000:0:16:1:25
col-md-6.dev219172.16.1.262607:b400:64:4000:0:16:1:26
col-md-7.dev219172.16.1.272607:b400:64:4000:0:16:1:27

vlan-guest

HostnameVLAN IDIPv4IPv6
col-md-5.dev801172.25.16.15/222607:b400:a00:1:0:25:16:15/64
col-md-6.dev801172.25.16.16/222607:b400:a00:1:0:25:16:16/64
col-md-7.dev801172.25.16.17/222607:b400:a00:1:0:25:16:17/64

vlan-user

HostnameVLAN IDIPv4IPv6
col-md-5.dev1250172.30.0.15/172607:b400:24:0:0:30:0:15/64
col-md-6.dev1250172.30.0.16/172607:b400:24:0:0:30:0:16/64
col-md-7.dev1250172.30.0.17/172607:b400:24:0:0:30:0:17/64

Lab

These are placeholder addresses, as these devices do not currently exist.

Management

Hostnamelab hostMACIPv4IPv6
lab-md-1.devadder172.16.19.131/282607:b400:62:6d40:0:16:19:131/64
lab-md-2.devcottonmouth172.16.19.132/282607:b400:62:6d40:0:16:19:132/64
  • Model: N/A
  • VLAN: 1499
  • AP Discovery VRRP: 172.16.19.135
  • AP Discovery VRRPv6: 2607:b400:62:6d40:0:16:19:135
  • Out of band: N/A
APs
  • IPv4 subnet: 172.16.19.144/28
  • IPv6 subnet: 2607:b400:62:6d80::/64

Central on Prem

As with other things, the domain is mobility.nis.vt.edu. For example, the hostname central has the FQDN central.mobility.nis.vt.edu.

HostnameInterfaceIPv4
centralens1f0198.82.169.222/24
central-node-1ens1f0198.82.169.223/24
central-node-2ens1f0198.82.169.224/24
central-node-3ens1f0198.82.169.225/24
central-node-4ens1f0198.82.169.226/24
central-node-5ens1f0198.82.169.227/24

Additional VIP hostnames:

  • central-central
  • apigw-central
  • ccs-user-api-central
  • sso-central

POD IP Range: 10.0.0.0/16 Service IP Range: 10.1.0.0/16

iLO Configuration

Access credentials

  • Local credentials only
  • See password repository for details

Network

iLO Dedicated Network Port > IPv4:

  • Not posting IPs because iLO is hella insecure. They are documented in the NEO password repo.
  • DNS: 172.19.128.3
  • IPv6 is currently not configured.

iLO Dedicated Network Port > SNTP:

  • Disable DHCPv4/6 Supplied Time Settings
  • Disable Propagate NTP Time to Host
  • Primary Time Server: 172.19.131.253
  • Secondary Time Server: conehead or grub
  • Time Zone: Bogota, Lima, Quito, Easter Time(US & Canada) (GMT-05:00:00) NOTE: changing SNTP values will likely require an iLO reset.

Monitoring

SNMP

Management > SNMP Settings:

  • System location: ISB 118
  • System contact: nis-wifi-g@vt.edu
  • System role: Central on Prem
  • System Role Detail: Node 1, Node 2, ...
  • Disable SNMPv1
  • SNMPv3 Users:
    • Security Name: nisnmp
    • See password repo for credentials
    • User Engine ID: blank
  • SNMP Alert Destinations:
    • akips.nis.ipv4.vt.edu
    • Trap Community: blank
    • SNMP Protocol: SNMPv3 Inform
    • SNMPv3 User: nisnmp

Syslog

Management > Remote SNMP:

  • Enable iLO Remote Syslog
  • Remote Syslog Port: 514
  • Remote Syslog Server: akips.nis.ipv4.vt.edu

Disable iLO Federation

iLO Federation > Setup:

  • Delete the default group
  • Disable multicast options:
    • iLO Federation Management
    • Multicast Discovery

IPv6

IPv6 is not supported at all. There is no way to configure an IPv6 address. Not only that, but when configuring the networks settings, we see:

Created symlink /etc/systemd/system/basic.target.wants/disable-ipv6.service → /etc/systemd/system/disable-ipv6.service.

smtp

Allowlist for mailrelay.smtp.vt.edu:

198.82.169.222,central.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.223,central-node-1.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.224,central-node-2.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.225,central-node-3.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.226,central-node-4.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.227,central-node-5.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"

Parts for redundancy

iLO Administrator and firmware password

The iLO "Administrator" account uses a password derived from the baseband serial number. This is done by the COP installation media. The same password is used for access to the firmware interface.

NOTE: This means that the serial numbers of the nodes are sensitive information! They are stored in the NEO password vault.

The script itself derives the password with the following commands (and some unnecessary file and variable creation...):

dmidecode -t baseboard \
  | grep Serial \
  | grep -o '[^ ]\+$' \
  | md5sum \
  | grep -Eo '^[^ ]+' \
  | cut -c1-8

We can simplify this to:

dmidecode -s baseboard-serial-number | md5sum | head -c 8

Managing the RAID from a live environment

HPE has a variation of secure boot enabled, so we cannot just boot to whatever we want. However, secure boot is just looking for something signed by Canonical... so just grab Ubuntu and be off. Other distros signed with common keys may or may not work, but COP is built on Ubuntu 18.04, so that is the least likely to cause issues.

Unlike the COP ISO, the Ubuntu image can be dd'd to a USB drive to create a bootable media. iLO can also be used to mount virtual media to boot from.

Add HPE repositories

The ssacli utility allows us to reconfigure the RAID setup. The best way to get this is by adding the HPE software delivery repository Management Component Pack.

/etc/apt/sources.list.d/mcp.list:

 # HPE Management Component Pack
deb https://downloads.linux.hpe.com/SDR/repo/mcp bionic/current non-free

Now, install the keys:

curl https://downloads.linux.hpe.com/SDR/hpPublicKey2048.pub | sudo apt-key add -
curl https://downloads.linux.hpe.com/SDR/hpPublicKey2048_key1.pub | sudo apt-key add -
curl https://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub | sudo apt-key add -

Then update the repositories:

sudo apt update

Convert array to RAID 10

This will take a long time. If building a new system, create a new array instead of migrating an existing one.

# ssacli
=> ctrl slot=0 ld 1 add drives=allunassigned
=> ctrl slot=0 ld 1 show status

   logicaldrive 1 (3.49 TB, RAID 0): Transforming, 0.83%

=> ctrl slot=0 ld 1 show status

   logicaldrive 1 (3.49 TB, RAID 0): Transforming, 0.83%

=> ctrl slot=0 ld 1 modify raid=1+0
=> ctrl slot=0 ld 1 show status

   logicaldrive 1 (3.49 TB, RAID 1+0): Transforming, 0.07%

=> ctrl slot=0 ld 1 show status

   logicaldrive 1 (3.49 TB, RAID 1+0): OK

=>

Build a new RAID 10 array

This is a destructive process, but much faster than migrating an array. It is necessary to install COP from an ISO afterwards.

# ssacli
=> ctrl slot=0 ld 1 delete
[confirm]
=> ctrl slot=0 create type=ld drives=allunassigned raid=1+0
=>

Drive replacement (RAID 0)

A failed drive in a RAID 0 array is catastrophic, thus re-installing COP from the ISO afterwards is required.

  • Physically replace the bad drive with a good one
  • Reboot the system
  • Press F9 during the boot to enter System Utilities, a BIOS like environment. You may need to press F1 to continue past the warning message (telling you a drive has failed and been replaced).
  • Select "System Configuration"
  • Select "Embedded RAID 1: HPE Smart Array P408i-a SR Gen 10"
  • Select "Array Configuration"
  • Select "Manage Arrays"
  • Select "Array A"
  • Select "List Logical Drives"
  • Select "Logical Drive 1 (...)"
  • Select "Re-Enable Logical Drive"
  • Confirm that you want to Re-Enable the Logical Drive. We are not expecting the data to be recoverable.
  • Exit the menus until you can exit the system utilities. Re-enabling the array does not count as a change, so there is no need to save.

Management

This is the stuff that helps us manage the wireless network. Various tools, automation, etc.

Down APs

Tools like AirWave or AKiPS will discover what APs are on the network and let us know when something goes down. This is good, but it doesn't tell us if the AP is expected to be down, replaced, or if a new AP has never come up. That is, it doesn't capture intent.

ATLAS is the authoritative source for intent and what is expected. The controllers are the authoritative source for what is reality.

Possible discrepancies

A non-exhaustive list of things that could be wrong:

  • An AP is down
    • Listed in Atlas
    • Not listed or down on the controller
  • Different AP is present
    • MAC does not match
  • AP was not removed
    • Not listed in Atlas (at least, not in the list of what is expected)
    • Is listed on the controller

Script

Here is a start to a script that does this comparison. Notably, it does not yet talk to ATLAS. Without real data, it is of limited use, even as a PoC.

AP Provisioning

AP Provisioning is automated with some code written by the NIS dev team. It is triggered two different ways: on demand and scheduled.

Core

Information gathering

The code ingests the MAC address of an AP. It queries the MM to determine the AP's:

  • name
  • group
  • AAC

It then queries the AAC to get the AP's LLDP neighbor information, where it finds:

  • The name of the switch
  • The interface description

NOTE: For the AAC, the MM returns the IP address as used by the AP. This IP address is how the code connects to the AAC.

Generating the name and group

It uses the LLDP information to determine the building abbreviation and the HLINK, if it exists. This is used to determine the expected group and name.

The AP name takes the form BLDG-HLINK, where:

  • BLDG is the uppercase building abbreviation
  • HLINK is the HLINK (or link identifier for older installations)

The AP group takes the form apg-bldg where:

  • apg is a fixed prefix
  • bldg is the lowercase building abbreviation

Edge cases

  • The HLINK may not exist yet (this is particularly common in new installations). In this case, the AP's MAC address is used in place of an HLINK.
  • The AP may already be provisioned in a custom group. If the AP's current group is of the form apg-bldg-foo, where -foo is some suffix beginning with a -, then this is considered a match, and the program will not move the AP to a different group.

Creating a group

When a program provisions an AP, it checks to make sure that the AP group exists. If it does not, it creates a group at the region level (see Configuration Hierarchy) which looks like:

{
  "dot11a_prof": {
    "profile-name": "rpa-default"
  },
  "profile-name": "apg-bldg",
  "reg_domain_prof": {
    "profile-name": "rdp-blacksburg"
  },
  "virtual_ap": [
    {
      "profile-name": "vap-eduroam"
    },
    {
      "profile-name": "vap-vtopenwifi"
    }
  ]
}

Regulatory domain

The regulatory domain is chosen based on the AAC prefix.

  • If the AAC starts with, col, bur, or res then the RDP is set to rdp-blacksburg
  • If the AAC starts with vtc, then the RDP is set to rdp-roanoke.
  • If the AAC starts with nvc, then the RDP is not set.
  • If the AAC starts with anything else, the group is not created.

On Demand

  • AP boots
  • MD sends an SNMP trap to AKiPS
  • Provisioning app periodically (every 2s) pull trap events from AKiPS (specifically, the host akips.nis.vt.edu)
  • waits 5 minutes
  • App looks up the AAC for that MD from the MM
    • probably with the show ap detail wired-mac xx:xx:xx:xx:xx:xx command (via api)

This is how APs are provisioned when they are deployed. This also fixes APs that are moved to a new location.

  • tool checks akips every 2s
  • events are added with a 5 min delay
  • 4 attempts with at a 5 minute interval before giving up

Scheduled

The reconciler runs at 06:00 ET daily. It pulls the AP database from the MM. It builds a list of APs that are incorrectly provisioned and runs the core process on them. This is how we get APs to have the correct name when the HLINK is assigned after the AP is deployed.

This process is limited to 20 APs per day.

Work order process

From earlyb:

The provisioning piece doesn't talk to Atlas at all. There is a WAP inventory job that does talk to Atlas. I don't remember exactly what that job does, but I think it generates a report of mismatches between the network and Atlas.

Limitations

Also from earlyb:

The only thing I can think of is that the provisioner is unable to talk to any controller that only has an IPv6 address. The docker swarm where it's deployed apparently has some problem reaching those addresses. This may be resolved in the future when we shift where it's deployed. Or maybe not.

Logs

  • Currently in the ELK stack
  • log_aaa-* index
  • 1s precision, look at the timestamp in the log
  • instance.name:orca-job-prod_wap-provision-*
  • fields.group: laa.nis.docker

Compromised user account

We occasionally get a request from ITSO to disable a user account and disconnect all associated network sessions. This is the procedure on how to do that for Wi-Fi sessions.

Find active sessions

Log into the mobility conductor (MC, previously called mobility master (MM)) via ssh, and use the show global-user-table command:

(isb-mm-1) [mynode] #show global-user-table list name PID

Global Users
------------
    IP                                  MAC            Name              Current switch  Role   Auth    AP name           Roaming   Essid    Bssid              Phy        Profile      Type  User Type
----------                         ------------       ------             --------------  ----   ----    -------           -------   -----    -----              ---        -------      ----  ---------
2607:b400:24:0:1234:5678:9abc:def  c6:ea:aa:11:22:33  PID@vt.edu         172.16.1.11     ur-vt  802.1x  SQUIR-238BA1077Q  Wireless  eduroam  48:2f:6b:a3:35:40  2.4GHz-HE  aaa-eduroam  N/A   WIRELESS
fe80::ab:cdef:123:4abc             c6:ea:aa:11:22:33  PID@vt.edu         172.16.1.11     ur-vt  802.1x  SQUIR-238BA1077Q  Wireless  eduroam  48:2f:6b:a3:35:40  2.4GHz-HE  aaa-eduroam  N/A   WIRELESS
2607:b400:24:0:123:4567:89ab:cdef  c6:ea:aa:11:22:33  PID@vt.edu         172.16.1.11     ur-vt  802.1x  SQUIR-238BA1077Q  Wireless  eduroam  48:2f:6b:a3:35:40  2.4GHz-HE  aaa-eduroam  N/A   WIRELESS
172.30.123.195                     c6:ea:aa:11:22:33  PID@vt.edu         172.16.1.11     ur-vt  802.1x  SQUIR-238BA1077Q  Wireless  eduroam  48:2f:6b:a3:35:40  2.4GHz-HE  aaa-eduroam  N/A   WIRELESS

Total entries = 4

Searching by the PID will return results for both PID (e.g., registered devices) and PID@vt.edu (e.g., eduroam).

Terminate the sessions

For each unique MAC address listed in the previous step, use the aaa user delate command to end the sessions. Note that deleting by the username from the MC is not currently supported.

(isb-mm-1) [mynode] #aaa user delete name PID
This command is not currently supported

(isb-mm-1) [mynode] #aaa user delete mac c6:ea:aa:11:22:33
Users will be deleted at MDs. Please check show CLI for the status
(isb-mm-1) [mynode] #show aaa user-delete-result

Summary of user delete CLI requests !
Current user delete request timeout value: 300 seconds

aaa user delete mac c6:ea:aa:11:22:33  , Overall Status- Response pending , Total users deleted- 0
MD IP : 172.16.1.11, Status- Complete , Count- 0
MD IP : 172.16.1.12, Status- Complete , Count- 0
MD IP : 172.16.1.13, Status- Complete , Count- 0
MD IP : 172.16.1.14, Status- Complete , Count- 0
MD IP : 172.16.1.141, Status- Complete , Count- 0
MD IP : 172.16.1.142, Status- Complete , Count- 0
MD IP : 172.16.1.143, Status- Complete , Count- 0
MD IP : 172.16.1.144, Status- Complete , Count- 0
MD IP : 172.17.1.11, Status- Complete , Count- 0
MD IP : 172.17.1.12, Status- Complete , Count- 0
MD IP : 172.17.1.13, Status- Complete , Count- 0
MD IP : 172.17.1.14, Status- Complete , Count- 0
MD IP : 0.0.0.0, Status- Response pending , Count- 0
MD IP : 0.0.0.0, Status- Response pending , Count- 0
MD IP : 172.16.236.151, Status- Complete , Count- 0
MD IP : 172.16.236.152, Status- Complete , Count- 0

You may notice in that example, the VTC controllers which are connecting over IPv6 (shown as MD IP : 0.0.0.0) still have the response pending. This seems to be a bug. To work around this bug, log into the appropriate MDs (reference "Current switch" column in the global users table) and run the same command.

(col-md-1) #aaa user delete mac c6:ea:aa:11:22:33

Enhancements

These are in no particular order. Small stuff listed below. Bigger items get their own pages (see left).

blacklist script

  • update local script with no ap ap-blacklist-time command
  • potentially work with devs to create orchestra app

Open Wireless Encryption (OWE)

  • Tested as working on AP-225
  • Not actually supported on AP-2xx
  • update clearpass to expect _owetm_ prefix and _951c89ea suffix
  • disable in VTC, due to high number of existing SSIDs

Automate AP provisioning

  • Related AUA
  • Need to setup a trap listener. Preferably, this would be the web app would do this, instead of us translating SNMP to REST before sending to the web app.
  • Traps should be sent over v6.

PAPI authentication

See ArubaOS 8.7.0.0 User Guide page 783.

Split management plane and data plane

  • Make it so the controllers can only be managed from the management network.
  • Make sure we are not poking holes into the management network in the process.
  • Potentially the same with CPPM

Central on Prem

Mostly a to-do list, but also just ideas we might want to implement in the future.

System monitoring

  • Drive failure
  • PSU failure (not tested, but should work the same way as drive failures)
  • Temp alerts (not tested, but should work the same way as drive failures)
  • System resources
    • CPU load
    • Memory usage
    • Disk IO

iLO

  • snmp
  • syslog
  • add os agent? it may help with disk monitoring and such (we get what we want without this)
  • LDAP login

iLO network

  • IPv6
  • remote console? not what I thought. Would have been an extra, anyways.
  • disable iLO federation
  • document iLO config / setup

Misc

  • AKiPS groups
    • Nodes
    • Cluster
    • iLO

EAP-TLS Project

Pre-project discussion

Timeline

  • Target production date: Fall 2022
  • Transition: Summer II - Fall 2022
    • Maybe a transition period
    • Maybe a transition point
  • Onboard in service: Jan 2022

Transition options

Hard cutover

Dual profile

Dual auth

Draft of project scope

  • Stake holders
  • Major milestones
  • Anticipated resources
  • Budget
  • What work needs set aside to get this done

External Resources

  • Liberty University
    • TJ Norton
    • In process of switching to EAP-TLS
    • Onboarding tool is SecureW2
  • UNC (Ryan Turner)

blockers

  • On boarding tool
  • CA for users
  • CA for auth server

Questions:

  • Do we want on-boarding as a cloud SaaS?
  • Do we care if the pki is in the cloud?
  • Define what the cert actually asserts
    • Creating a trust relationship between a device and the entity VT
    • Associating a user/entity/org with that device
  • Define a CPS
    • Do we have a crl/ocsp? (prolly not)
  • What attributes does the root CA need?

Endpoint management

We want to be able to integrate with:

  • JAMF (macOS)
  • InTune/AD (Windows)
  • Bigfix
    • macOS
    • Windows
    • Optional

Challenges

Certificate management

Something needs to issue the client and server certs. InCommon is ill suited for both. See the preproject page for more discussion.

Onboarding

A tool is needed to work well for BYOD and managed devices. These may not be the same tool.

Apple CNA

Apple uses a limited browser for captive portals. This can interfere with the profile provisioning tool.

Relevant educause discussion

On-boarding tool

Objective

On-board a device to the VT wireless network. This establishes trust that a device belongs to a particular entity (user or organization).

Necessity

fn main() {
   let project = 42;
   let tool = Tool {
       works: true,
       easy: true,
   };

     if !(tool.works && tool.easy) {
         drop(project);
     }
   else {
       // println!("https://www.youtube.com/watch?v=ZXsQAXx_ao0");
       println!("Let's go!");
   }
}

struct Tool {
   works: bool,
   easy: bool,
}

Values

Roughly in order:

  1. Interoperable: cross-platform across all major platforms
  2. Usable: easy to use
  3. Robust: hard to get wrong
  4. Maintainable: easy to update to keep up with new demands
  5. Interoperable: integrate with other tools
  6. Supportable:

On-boarding Tool Requirements

These are the things we will be looking for in deciding on a tool. Obviously, cost is also a consideration.

MUST have

Tools that do not meet these criteria will not be considered. These are the things that we would rather not deploy EAP-TLS than compromise on.

front end

  • Platform support
    • Windows 10
    • Windows 11
    • macOS
    • iOS
    • Android
      • including Android 11, December 2020 patch
    • manual install (Linux devices)
  • Easier to use than:
    • non-sponsored guest (taking into account re-registering every day)
    • Current PEAP/MSCHAPv2 process (with unknown password)
  • SSO integration
  • remove and/or replace old profiles

back end

  • Per device certs
  • Certs issued to:
    • User
    • Organization
  • Setup correct trust of server
    • Set specific CA
    • Set leaf CommonName / subjectAltName
  • Stupidly long client cert lifetime (e.g., 50 years)
  • No cloud PKI
  • Ability to expand to external CA

SHOULD have

We would rather deploy without these than not deploy, but we aren't going to be happy about it.

front end

  • Easier to use than:
    • non-sponsored guest (not taking into account re-registering every day)
    • Current PEAP/MSCHAPv2 process (with known password)
  • vt.edu URL

back end

  • Internal CA (with an intermediate root)
  • ECC certs (P-256, or ed2519)

Low priority niceties

Extras that in particular will make future expansions of the service better.

  • Passpoint support
  • AD integration
  • Multiple root CA support
  • ed25519 support

Contenders

SecureW2

Based on feedback from peers, this is the most likely candidate. It works well and is a reasonable cost.

Link

ClearPass Onboard

Again, based on the feedback of peers, this seems to be an excellent product, possibly better then SecureW2, but is very expensive. Even the vendor admits that it is priced too high.

Nonetheless, given we already have a CPPM instance running, it is worth taking a look at it.

Honorable mentions

eduroamCAT and geteduroam

Notably, it does not seem to support macOS1, which makes it a non-starter.

Open-source, community-driven project, with all the good and bad that comes with that. It would definitely be more effort to setup, probably more than we care to do.

Links:

Ruckus XpressConnect

Notably, we used to run XpressConnect before ditching it in favor of... nothing (with eduroamCAT as a backup). It is not likely that we are going to move back to it.

Sectigo Mobile Certificate Manager

Middleware is considering this as an option for an internal CA. It appears to have a certificate provisioning component as well.

Concerns:

  • Middleware seems o be leaning toward using AWS as CA service.
  • It seems prudent to not tie the on-boarding tool to the CA we are using.
  • It is not clear if this will work for non-mobile platforms (e.g., Windows, macOS)

Reference [pdf][secitgo].

Authentication

Auth from a cloud service?

No. Right now our cloud exit strategy is "don't exit". The ongoing cost to maintain the eduroam authentication service is fantastically little. This makes the tradeoff between up-front engineer time and a perpetual bill from a service provider (not to mention a soft vendor lock in).

IPv6

The goal is to be able to remove any legacy IP address from the mobility infrastructure.

The expectation is that the MMs and the APs will be able to hit this mark. MDs will need a legacy IP address on VLANs that are serving a captive portal to clients, and possibly to assist with multicast discovery.

Current status

burcolresnvcvtc
dns44444,6
ntp44444,6
syslog44444
snmp traps44466
cppm auth44444
cp redirect4444,64,6
user interface (bfv)444,64,64,6
user interface (guest)4,64,64,64,64,6
mgmt interface4,64,64,64,64,6
cluster (md-md)44444
mm-md (masterip cmd, ipsec tunnel)44444
AP44444
captive portal server group44444

Post-Mortem

Motivation

The primary goal of the out-of-band (OOB) management network is that the devices are remotely manageable in the case of disaster, when the rest of the network is not functional, and thus service can be restored.

The secondary goal/benefit of the OOB management network is for security. Isolating management to only the OOB network significantly reduces the attack surface of the equipment.

Counter motivation

The first goal is irrelevant, because the Wi-Fi network is an overlay. If the equipment is not reachable it is because the underlay is not working, and we'll be fixing that first. Two notes here:

  • Administrators of the Wi-Fi network need some kind of network connectivity that isn't the VT Wi-Fi network, which is trivial. A wired adapter, home ISP, mobile hotspot, any of these will do.
  • To address the case of a device with an unusable network configuration (e.g., the out of box config), they still need some kind of non-network access (i.e., serial), though that access can be reachable through network resources. Indeed, serial connection accessed through the OOB network is already part of our standard setup.

The second goal strongly implies (though doesn't strictly require) that the management of a device is isolated to that device. This is not the case with the Wi-Fi infrastructure. The configuration is all done on the MC, which is pushed to the MDs, which is in turn pushed to the APs.

More critically, there is a need for the management to have a clear separation from the production and support network. An overlay design does not lend itself to this, and sure enough, it does not exist in the wireless controllers. In particular, the controllers do not have multiple routing tables, which makes it extremely difficult if not impossible to separate the different network planes.

In particular:

  • user traffic is carried to the MD inside a tunnel
  • MDs in a cluster build a tunnel and have a host specific route to each other
  • MDs build a tunnel and have a host specific route the MCs

This means any wireless user can reach the management of the MC any MD in the cluster they are connected to. This could be stopped with a client ACL, but it must:

  • be applied to every role
  • enumerate every address (including IPv6 link local!) on every controller

This is obviously error prone and a fair bit of work, all to accomplish a secondary goal. And we still end with a design that is only a weak assurance of this goal (e.g., have we found every path into the management plane? Probably not.)

Can reach the L4 management interface that is. Obviously, L7 still needs auth(z).

Out-Of-Band Management

Logical Diagram

Logical diagram of the wireless management connections

Data paths

  • MD join clusters with the in band management address
    lc-cluster group-profile "lcc-foo"
        controller-v6 <blue> priority 255 mcast-vlan 0 vrrp-ip-v6 <blue> vrrp-vlan <blue> group <#>
    
  • APs connect to cluster on in band management
  • In band mgmt and user networks are trunked over the same port channel.
  • MD controller IP is in band mgmt
    masteripv6 ... interface-f vlan-f <blue>
    ...
    controller-ipv6 vlan <blue> address <blue>
    
  • mgmt auth (i.e., netadmin) for MDs happens on OOB mgmt
  • user auth (e.g., eduroam) happens on in band mgmt
  • MC-MD management happens inside the IPsec tunnel that gets built over the in band management.

Questions

  • How do we prevent mgmt login from non-OOB mgmt networks? If we can't do this, we haven't actually done anything.
    • Force management to ports 22 and 4343, and only allow these on OOB
      • AP-MD and MD-MC management is done through a tunnel, thus not stopped by these ACLs. This is good for the purposes of getting things to work, but kinda violates the principles we are after to begin with.
      • Captive portals use ports 80 and 443 and we can force HTTPS management to exclusively 4343. This lets us expose a L7 distinction in L4. Again, this functions, but eww.
  • How many captive portal users are legacy only? Do we need this legacy address?
  • Can we do no legacy addresses?
    • No. At the least, we need legacy addresses for RAPs.
  • Can we add members to a cluster by an IP that is not the controller IP?
    • Yes
  • Do we want to keep a legacy address on in band mgmt to give us time to migrate APs? (And to have less changes at once)
    • Yes. Lets make less changes at once.

TODO

conehead/grub

MM

Nothing?

MD

  • Wire up MDs on OOB
  • Address MDs on OOB
  • Apply static route to OOB network
  • Apply ACLs to limit port 4343 and 22 to only be allowed on the OOB side [NISNETR-398]
  • Change asr-conehead-netadmin to use the OOB v6 address on conehead [NISNETR-399]
  • Change asr-grub-netadmin to use the OOB v6 address on grub [NISNETR-399]
  • Figure out initial setup
  • Remove remaining legacy addresses

Config changes

The MM is configured exactly the same as before. The MDs have additional configuration (col-md-5.dev as an example here):

interface gigabitethernet 0/0/0
    no shutdown
!
vlan 301
    description oob-mgmt
!
interface port-channel 1
    gigabitethernet 0/0/0
    switchport access vlan 301
    switchport mode access
    trusted
    trusted vlan 1-4094
!
interface vlan 301
    operstate up
    ipv6 address 2607:b400:e1:4000:0:0:0:15/64
!
ipv6 route 2607:b400:e1:0:0:0:0:0/48 2607:b400:e1:4000:0:0:132:1

Old ideas

These are things we are currently deciding against. They are noted here in case they turn out to be a good idea or lead to other useful ideas.

MC-MD connection:

  • Static routes over OOB
  • IPsec tunnel between MC and FW

Remote APs

Overview

Also known as a RAP.

Steps:

  1. RAP IP pool on /mm
  2. Public addresses
  3. DNS
  4. Cluster

IP Pool

The RAPs use an IP address inside the IPSec tunnel. The scope of this address is limited to the AP and MD, which makes it a good candidate for link local addressing. Each RAP uses 1 address, so make sure the pool has at least as many addresses as there are RAPs.

It is configured as a lc-rap-pool at /mm. By convention, we use the prefix rapp-.

CLI

Configure (at /mm):

lc-rap-pool rapp-rap 169.254.10.10 169.254.10.50

Verify:

(isb-mm-1) [mm] #show lc-rap-pool

IP addresses used in pool rapp-rap
         169.254.10.10-169.254.10.21

IPv4 pool : Total - 12 IPs used - 29 IPs free - 41 IPs configured

IPv6 pool : Total - 0 IPs used - 0 IPs free - 0 IPs configured
LC RAP Pool Total Allocs/Deallocs/Reserves : 13/0/0
LC RAP Pool Allocs/Deallocs/Reserves(succ/fail) : 12/0/(0/0)

API

Config:

{
  "lc_rap_pool":[
    {
      "pool_end_address": "169.254.10.50",
      "pool_name": "rapp-rap",
      "pool_start_address": "169.254.10.10"
    }
  ]
}

Running the show command via API does not return (meaningfully) structured data (last tested on AOS 8.7.1.2).

Public addresses

The key requirement is n public legacy (IPv4) addresses for n controllers in the cluster.

Documentation suggests that the public address could exist on a NAT device. We've opted to set it up directly on the MD. This is done just like any other vlan interface.

It doesn't make any sense to use IPv6 with the RAP service.

  1. If we knew we had IPv6 connectivity from the remote location, we could just setup the AP as a campus AP (CAP) with CPSec. Improved RAP discovery with Aruba Activate may be a compelling reason to go with a RAP anyways. We haven't yet gotten that far with the RAP setup, though.
  2. Too many ISPs still offer legacy-only connectivity.

Also, RAPs cannot use a VRRP address to connect to the cluster, so don't bother setting up an AP discovery VIP.

DNS

  1. RAPs must look for the MDs by DNS (since VRRP isn't an option)
  2. VT uses the address rap.mobility.nis.vt.edu
  3. This name must resolve to each of the public addresses of the MDs in the cluster.
  4. The MDs take care of load balancing once the RAP has connected, so any method DNS uses (round-robin, ordered list, etc) is fine.
$ dig +short rap.mobility.nis.vt.edu
198.82.171.142
198.82.171.141

Cluster

The only extra step here is to provide the RAP external IP.

Remember to follow the usual clustering steps as well (vlan excludes, join the md to the cluster, etc)

CLI

(isb-mm-1) [rap] #show configuration committed | begin lcc-
lc-cluster group-profile "lcc-col-rap"
    controller 172.16.1.31 priority 255 mcast-vlan 299 vrrp-ip 172.16.1.41 vrrp-vlan 299 group 0 rap-public-ip 198.82.171.141
    controller 172.16.1.32 priority 128 mcast-vlan 299 vrrp-ip 172.16.1.42 vrrp-vlan 299 group 0 rap-public-ip 198.82.171.142
!

API

{
  "cluster_prof": [
    {
      "cluster_controller": [
        {
          "group_id": 0,
          "ip": "172.16.1.31",
          "mcast_vlan": 299,
          "prio": 255,
          "rap_public_ip": "198.82.171.141",
          "vrrp_ip": "172.16.1.41",
          "vrrp_vlan": 299
        },
        {
          "group_id": 0,
          "ip": "172.16.1.32",
          "mcast_vlan": 299,
          "prio": 255,
          "rap_public_ip": "198.82.171.142",
          "vrrp_ip": "172.16.1.42",
          "vrrp_vlan": 299
        }
      ],
      "profile-name": "lcc-col-rap",
      "vrrp_info": {
        "vrrp_id": 240,
        "vrrp_passphrase": ""
      }
    }
  ]
}

Monitoring

Ignore the colors. Splunk picks the colors, so red might mean accept or some other nonsense. Make sure you look at the legend.

eduroam

eduroam splunk dashboard

Row 1

  • Overall distribution of requests.
  • This is sourced from the authentication servers.
  • Time selected from the "Recent time" picker.

Row 2

  • Outcome ratios broken down by cluster
  • Sourced from the authentication servers (FreeRADIUS).
  • Time selected from the "Recent time" picker.
  • Timestamps of these logs are based on when the server has a response prepared to send, not when it is actually sent. Notably, rejects get a 1s delay (by design).

Row 3

  • Outcome ratios broken down by cluster.
  • Sourced from the controllers.
  • Time selected from the "Recent time" picker.
  • A reject log is generated from the dot1x-proc process.
  • An accept log is generated from the authmgr process.
    • log generated when an entry is added to the user table
    • log per IP address, not per authentication request.
    • Typically 3-4 times as many accepts compared to row 2.
  • A device that gets an accept, but is unable to get an IP address is not logged from the controller's perspective.

Row 4

  • Top talkers
  • Sourced from the authentication servers.
  • Time selected from the "Top time" picker.

ClearPass (CPPM)

ClearPass splunk dashboard

  • Due to MAC auth, it is normal for there to be far more rejects than accepts.
  • Extraordinarily few rejects are actually sent. Instead devices are "rejected" by not assigning a role.
  • Web auth happens after the user gets an IP address.

Left column

  • Outcome ratios broken down by cluster.
  • Sourced from the controllers.

Right column

  • Outcome ratios broken down by cluster.
  • Sourced from the authentication servers (CPPM).
  • For more details on recent events, check the access tracker in CPPM.

Export CPPM guest accounts (cppm 6.6)

This is all done from the Guest side of CPPM.

Enable viewing passwords

  • Go to Configuration > Guest Manager
  • Enable the 'Password Display' option to view guest account passwords.

Customize default export view

  • Go to Guests > Export Accounts > Customize default export view
  • Look for the field password in the list. If it is not there, click 'Add Field'.
  • In the "Field Name" drop box, select "password".
    • Optionally, set the "Rank"
  • Save Changes
  • Use this view

Export the data

  • On export page (Guests > Export Accounts), select what kind of export you want and save the file.

Unsorted images of the process

Factory reset CPPM

Everything is from the serial console, logged in as appadmin.

Save licensing info

[appadmin@cppm]# show license
-------------------------------------------------------
Application              : ClearPassPlatform
License key              : -----BEGIN CLEARPASS PLATFORM LICENSE KEY-----
[snip!]
-----END CLEARPASS PLATFORM LICENSE KEY-----
License key type         : Permanent
License added on         : 2022-03-08 18:55:04
Validity                 : <not applicable>
Customer id              : [snip!]
Licensed features        : <not applicable>

=======================================================

The license key may look like a base64 encoding with a header/footer like above, or it might be formatted similar to a Windows license key.

Whatever the case, grab all the output and keep it somewhere safe.

Wipe the database

[appadmin@cppm]# cluster reset-database -f

The -f option (think --force) will wipe any local IP entries in the database, as well as licensing.

Reset and Reboot

[appadmin@cppm]# system factory-reset

According to TAC, this does something close to resetting the database without -f. Notably, it also reboots the box and takes you to the initial setup wizard, so it is probably a good starting place.

Note that after the reboot, the login screen may display a message about upgrading and to not make any config changes. Press Enter occasionally until that message no longer shows before starting. It will take several minutes.

Guidelines

This is a collection of the less technical side of things. Policies, procedures, conventions, and the like are all collected here.

Administrator Access

Credentials

Password authentication is handled through:

  • netadmin RADIUS instance
  • single local account for backup

Key authentication is handled through:

  • local accounts

Roles

  • ArubaOS:
    • There is a predefined, uneditable list of roles.
    • Local accounts cannot be created without a role.
    • RADIUS accounts set the role with Aruba-Admin-Role VSA.
      • If the VSA is missing, then a default role is applied.
        • The default role is set in the "Management Authentication Profile".
        • Absence of config default is root
        • API: .mgmt_auth_profile.mgmt_default_role.aaa_auth_mgmt_default_role
        • CLI: aaa authentication mgmt default-role
      • If the VSA is invalid, access is denied.
  • Airwave:
    • Roles can be created and edited.
    • Local accounts cannot be created without a role.
    • A RADIUS account uses the Arbua-Admin-Role VSA (same as ArubaOS).
      • If the VSA is missing, access is denied.
      • If the VSA is invalid, access is denied.

netadmin accounts

Role config:

  • The default role is set to read-only at the highest possible nodes.
    Rational:
    • Damage control in the case of a misconfigured RADIUS account or ArubaOS behavior change.
    • All controllers are descendants of these two nodes.
  • All RADIUS accounts must have the Aruba-Admin-Role set.
    Rational:
    • Implicit authorization is confusing and makes it easy to miss mistakes.
  • Accounts that should not have access to the wireless controllers should user the value deny, or a role that is exclusive to AirWave.
    Rational:
    • Not all netadmin accounts should have controller access.
    • Some users need access to AirWave, but not the controllers.
    • A bogus value is the only way to deny access to a netadmin account.
    • A consistent, clear value makes for easy auditing.

Who:

  • Members of the Network Engineering and Operations (NEO) team have full access.
  • Support staff may have read-only access.
    • This is approved by the wireless team lead or Network Operations Manager. Verbal approval is fine.
  • Automation has the least access possible for its tasks.

Local accounts

Config:

To view local users via API, check:

  • .mgmt_user_cfg_int
  • .mgmt_user_ssh_pubkey
  • .mgmt_user_web_cacert

Or from the command line:

  • show mgmt-user
  • show mgmt-user ssh-pubkey
  • show mgmt-user webui-cacert

Note that each of these lists partition1 the local users. That is show mgmt-user will not show users with ssh pubkey access.

admin user

This account can be created while setting up a controller. We opt to do so, as it eases the painful process of setting up an MD.

If the account is created on setup:

  • The username is admin.
  • The password is set by the engineer.
  • ArubaOS doesn't give a choice on either of these.

The account is created at the device level of the config hierarchy, so it overrides any other config that may be set. This creates a management headache, so we opt to remove the account after the MD connects to the MM.

The account on the MMs is a special case. Aruba, in their infinite wisdom, does not allow it to be deleted, nor the role changed. We opt to set a randomly generated password and throw it away. This effectively disables the account.

nis user

  • This account is the backup in case network connectivity is lost. Rational:
    • Entropy happens
  • It is configured at the highest possible nodes.
    Rational:
    • Entropy happens
    • Centralized config makes for easy password changes
  • Role is root.
    Rational:
    • Full access is required to make network config changes

Server settings

Telnet

Telnet is awful and is rightfully disabled by default. We leave it disabled. Unfortunately, verifying it is still disabled is a little tricky. This config is not part of the JSON that can be retrieved from the API. Instead we must either:

  • run the show configuration command from ssh (not the API) for each node
  • run show telnet on each device directly (either ssh or API).

To do the latter with the python library, do something like:

mm = arubaos.MobilityMaster(f"isb-mm.{domain}", creds)
for host in [md.name for md in mm.mds()]:
    arubaos.Controller(f"{host}.{domain}", creds).show("telnet")["_data"]

Don't forget to check both MMs as well.

SSH

The issues that are actual exploits in the wild are not able to be configured incorrectly. There are a few knobs that are closer to theoretical weaknesses that we opt to tighten up:

  • DSA < RSA
  • CBC < CTR
  • SHA1 < SHA256 (used in an HMAC; when used for signing, it is more serious)

In the API:

{
  "aaa_ssh_cipher": {
    "cipher_suite": "aes-cbc"
  },
  "aaa_ssh_mac": {
    "hmac-sha1": true,
    "hmac-sha1-96": true
  }
}

Again, we find that disabling DSA (.aaa_ssh_dsa) is missing from the config pulled via API.

Audit trail

Commands run on the controllers, via SSH or API, can be found with the Splunk report ArubaOS command audit.

1

Using the set theory definition of the word here

Upgrades

Procedure

All production upgrades MUST be documented in the Engineering Change Order (ECO) app (or its replacement), and follow the normal ECO procedure.

All upgrades should be tested in the dev environment before pushing to production.

When to upgrade

ArubaOS upgrades do not occur on a regular schedule. Rather, they are as an as needed or as available basis. Several things can motivate an upgrade. Roughly in order of priority:

  • Security fixes
  • Stability fixes
  • New features
  • Incremental update available

The rational for security and stability are obviously the top priority when providing a network service. Similarly, new features allow us to provide a better or new services.

When there is a security upgrade, the system should be upgraded ASAP; target within the week. Of course, this depends on the severity of the vulnerability. For example, a CVE score of 9+ may motivate an upgrade outside the normal change window to expedite the fix. A CVE score of 3 may wait until after a maintenance restriction window (such as due to semester startup or finals).

Stability fixes should be implemented in the next change window or two, pending testing of the code.

New features can wait the longest before role out. It is more important that the system be stable and predictable than have the shiniest new feature.

Incremental updates should be implemented in 10 days of release, but not during a maintenance restriction.

What to upgrade to

Staying on the latest release within a code train has allowed us to be patched against security vulnerabilities before they are announced. When we lag behind, we have hit stability bugs that are already fixed in newer releases.

Aruba has two public release types: "Standard" and "Conservative". The conservative release is the more stable of the two.

At the time of this writing, we are on the 8.7 train, which is a standard release. It has a few key features:

  • AP support
    • RAP 500 series
    • AP 560 series
  • IPv6
    • MM/MD connection
    • clustering
    • AirWave

Conservative release

This is the generally preferred release. Unless there there are known issues with the newer version, always go to the latest version.

Standard release

If we are already on a standard release (such as the time of this writing), stay within the same major/minor version (e.g., 8.7.0.1 to 8.7.1.2 is good).

Bugs

This is a way to keep track of bugs that we have come across.

Create a section in prework when we notice it. This provides a way to start keeping track of trends and not lose info, especially for issues that are not urgent enough to address right away.

When we start working on it in earnest (open a TAC case, create a JIRA ticket, etc), move that section to outstanding.

When the issue is resolved, move it to resolved.

If we work around an issue without resolving it, move it to workaround.

Outstanding

Controller IPv6 traffic stops

  • Description: All IPv6 traffic to/from the controller itself ceases. This does not impact user traffic.
  • Detection:
    • The MD is reachable over IPv4, but not IPv6.
    • The MD is unable to ping its IPv6 gateway.
    • If the MD has established sessions (e.g., tunnels) to IPv6 addresses, those may continue to work.
    • IPv6 neighbor table is stuck. It neither adds nor removes items dynamically.
    • AKiPS availability
    • AKiPS status
  • Workaround:
    • IPv6 dependencies have been (mostly?) removed. User impact should be minimal to none at this point. See Enhancements/IPv6 for details.
    • Bounce the link to the impacted controller. This can be done from either the controller or the router side. Additionally, it seems we can take down a single link in the port channel and bring it back up. Usually the link needs to stay down for a few seconds.
    • While the above does (temporarily) restore IPv6, it seems the failover mechanisms are broken, meaning the workaround is user impacting. Current practice is to leave IPv6 broken.
    • Add a static neighbor entry:
      (isb-mm-1) [00:1a:1e:03:03:08] #show configuration committed | include neigh
      ipv6 neighbor 2607:b400:64:4000::1 vlan 299 00:31:46:17:df:f0
      
      This allows traffic to flow through the gateway, but does not allow traffic from the gateway itself.
      (col-md-2) *#show ipv6 route
      
      Thu Jan 18 12:52:18.483 2024
      
      Codes: C - connected, O - OSPF, R - RIP, S - static
             M - mgmt, U - route usable, * - candidate default
      
      Gateway of last resort is 2607:b400:64:4000::1 to network ::/128 at cost 1
      S*    ::/0 [0/1] via 2607:b400:64:4000::1*
      C    2607:b400:64:4000::/64 is directly connected, VLAN299
      C    2607:b400:a00:1::/64 is directly connected, VLAN801
      (col-md-2) *#ping ipv6 2001:468:c80:210f:0:165:9b7d:7dcb
      
      Press 'q' to abort.
      Sending 5, 92-byte ICMPv6 Echos to 2001:468:c80:210f:0:165:9b7d:7dcb, timeout is 2 seconds:
      !!!!!
      Success rate is 100 percent (5/5), round-trip min/avg/max = 0.468/0.5676/0.656 ms
      
      (col-md-2) *#ping ipv6 2607:b400:64:4000::1
      
      Press 'q' to abort.
      Sending 5, 92-byte ICMPv6 Echos to 2607:b400:64:4000::1, timeout is 2 seconds:
      .....
      Success rate is 0 percent (0/5)
      
  • Impact:
    • All communication between the controllers and CPPM is over v6. Thus, no clients can authenticate on the VirginiaTech SSID.
    • Other system services on the controller happen over v6 including NTP and DNS.
    • Whatever the impact of an out-of-date neighbor table is.
  • Unknowns / next steps:
    • What is the impact of an incorrect neighbor table on an MD? E.g., what is the impact on a client that is not in the table? Does this impact Air Group? Efficiency/speed the MD can switch packets? Does this prevent the MD from short-circuiting or optimizing client ND?
    • Is the MD participating in neighbor discovery at all? Is it sending/receiving NS? Is it sending NA?
  • TAC cases:

Config out of sync

  • Description: The config on the MC device node and on the corresponding MD are different.

  • Symptoms From the MM:

    (isb-mm-1) *[00:1a:1e:02:d8:90] #show configuration effective | include
    debugging
    logging user-debug 9c:b6:d0:da:1e:8f level debugging
    logging arm-user-debug 9c:b6:d0:da:1e:8f level debugging
    (isb-mm-1) *[00:1a:1e:02:d8:90] #
    

    From the MD:

    (col-md-1) *#show running-config | include debugging
    Building Configuration...
    logging security process dot1x-proc level debugging
    logging level debugging arm-user-debug 9c:b6:d0:da:1e:8f
    logging level debugging user-debug 9c:b6:d0:da:1e:8f
    
  • TAC case: 5360416723

  • Notes

    • ccm-debug full-config-sync did not resolve the issue
    • Problem went away on it's own, probably from subsequent commits.
    • Currently writing a script that compares the config from API

Resolved

Connectivity failures (Aruba Support Advisory ARUBA-SA-20210901-PLVL04)

  • Description: Clients have association failures.

    This case morphed into the Linux client issue. Linux clients would occasionally just stop passing traffic. The device would still be associated, but it could not even ping the UAC. It was mostly observed on Intel AX200 and AX210 cards, but has also been seen on Intel's AC cards and the MediaTek MT7921K. The problem looked like a driver / kernel issue, but its disappearance is more closely correlated to upgrading to ArubaOS 8.10.

  • Symptoms:

    • Clients experience association failures during high bursts of client roaming events.
    • High CPU utilization by the Station Management process (stm) in the MDs.
    • show papi kernel-socket-stats | include 8345,8222,8419,Drops
      • Drops value on port 8419 (STM Low Priority) rapidly increases in 100+ increments within seconds AND sustained large values for CurRxQLen and Drops on port 8435 (STM),
    • show cpuload current
      • stm process stays over 100%
  • TAC cases:

  • Notable versions:

    • 8.7.1.4: observed
    • 8.7.1.5: observed
    • 8.7.1.6: Sanjay claims a fix
    • 8.7.1.6: observed
    • 8.10.0.6: presumed fixed
  • Debug: Logs requested by Rodger: Make sure user debug is enabled:

    logging user-debug <client-mac> level debug
    
    • Currently enabled for waldrep's laptop (46:96:f1:03:32:98)
    no paging
    show cli-timestamp
    show clock
    show ap association client-mac <client-mac>
    show station-table | include <client-mac>
    show auth-tracebuf mac <client-mac>
    show ap client trail-info <client-mac>
    show datapath session table | include <ip address of client>
    show log user-debug 50 | include <client-mac>
    show log security 50 | include <client-mac>
    show log system 50 | include <Affected_AP_Name>
    tar log tech-support
    

    Collect the following when at the time of the issue along with tech support logs:

    clock cli-timestamp
    show dot1x watermark history
    show papi kernelpsocket-stats
    show ap debug client-mgmt-counters
    show ap debug sta-msg-stats
    show ap debug cluster-counters
    show ap debug gsm-counters
    show ap debug client-deauth-reason-counters
    show cpuload current
    show datapath bwm table
    show datapath utilization
    show datapath papi counters
    show datapath debug opcode
    show datapath network ingress
    show datapath maintenance counters
    show datapath debug dma counters
    show datapath message-queue counters
    show auth-tracebuf
    

Kernel panics

  • Description: MD crashes with a kernel panic
  • Symptoms
    • MD reboots
    • Kernel panic
    • TAC asked for kernel core dumps. This option has been enabled for a while, but doesn't seem to be giving what they are asking for.
    • Intent:cause:registers:
      • 12:86:b0:2
      • 12:86:b0:4
      • 12:86:e0:2
      • 12:86:e0:4
      • 12:86:e0:8
      • 78:86:50:2 (logs lost)
  • Bug IDs
    • AOS-216744
  • TAC cases:
  • JIRA tasks:
  • Notable versions:
    • 8.5.0.11:
      • Observed
        • 12:86:e0:2
    • 8.7.1.3:
      • TAC asserts fixed:
        • 12:86:e0:2
    • 8.7.1.4:
      • Observed:
        • 12:86:e0:2
    • 8.7.1.5:
      • TAC asserts fixed:
        • 12:86:e0:2
        • 12:86:e0:4
        • 12:86:b0:4
      • Observed:
        • 12:86:b0:2
        • 12:86:b0:4
        • 12:86:e0:8
    • 8.7.1.5_81619:
      • Observed:
        • 12:86:b0:4
    • 8.7.1.6:
      • TAC asserts fixed
        • 12:86:b0:2

res-md-1 refuses clients

  • Description: any client trying to use res-md-1 as a UAC cannot associate.
  • Symptoms:
    • show lc-cluster load distribution client shows 0 active and 0 standby clients for res-md-1.
    • started with res-md-1 crashing
    • persisted across a reboot and code upgrade
  • TAC cases
  • Notable version:
    • 8.7.1.4: crash that initiated the problem
    • 8.7.1.5: observed

Holy amon logs, Batman!

  • Description: A debug trace on amon_sender_proc and amon_recvr_proc is logged and cannot be disabled. Collectively, the controllers sent over 20,000 logs/s. The problem only showed up on some boots.
  • Bug IDs:
    • AOS-210452
  • TAC cases:
  • Notable versions:
    • 8.7.0.0: bug introduced
    • 8.7.1.4: fixed
  • JIRA task:

No state attribute in RADIUS request

  • Description
    • The RADIUS request packets do not contain the state attribute value and hence, clients face connectivity issue.
  • Bug IDs
    • AOS-207701
    • AOS-218006
  • Notable versions:
    • 8.4.0.0: introduced
    • 8.7.1.3: fixed

Too many pending changes

  • Description
    • If the expected output of show configuration unsaved-nodes was over 1024 characters, then it displayed nothing.
    • This also impacted API output.
  • Bug IDs
    • AOS-210404
  • Notable versions:
    • 8.5.0.10: observed broken
    • 8.5.0.12: fixed
    • 8.7.0.3: fixed

Prework

show global-user-table crashes auth module

  • Description: Running said command on the MM often returns no results and crashes the auth module 1-2 times.
  • Symptoms:
    • Running show global-user-table list mac <mac> hangs for about a minute, sometimes not returning anything.
    • When the command completely fails, it throws an error about the auth module being busy
    • show crashinfo shows that the auth module crashed 1 or 2 times
    • Happens via ssh and api.
  • Workaround:
    • Check each MD directly
    • Try again later
  • Notable versions:
    • 8.7.1.4: observed
    • 8.7.1.5: observed immediately after upgrade, but haven't been able to recreate since
    • 8.7.1.6: observed

APs crashing

  • Description: A lot of APs crashing
  • Symptoms:
    • A few APs crash repeatedly (started keeping track in 8.7.1.5):
      • VAW-152TP01B (res)
      • LIB-234BA1188L (col)

Workaround

Delegated commands to v6 controllers fail

  • Symptoms:
    • aaa user delete commands from the MC do not ever get a response from the v6 controllers.
    • Running a second command requires waiting for the timeout (default 300 s)
  • Recreate the problem:
    • Have at least one MD connect to the MC over IPv6, and note which MD these are. To do this, configure them with the masteripv6 or conductoripv6 command instead of the masterip or conductorip command.
      (isb-mm-1) [mynode] #cd vtc-md-1
      (isb-mm-1) [00:1a:1e:04:b1:10] #show configuration committed | include conductor
      conductoripv6 2607:b400:2:2000:0:173:32:36 ipsec-factory-cert conductor-mac-1 20:4c:03:8f:53:1a conductor-mac-2 20:4c:03:0e:e0:44 interface-f vlan-f 100
      
      This can be verified with the show switches debug command, noting which version is used in the "IP Address" column.
      (isb-mm-1) [mynode] #show switches debug
      
      All Switches
      ------------
      IP Address                     MAC                Name      Nodepath         Type       Model           Version         Status  Uptime       CrashInfo  Config Sync Time (sec)  License  Release Type
      ----------                     ---                ----      --------         ----       -----           -------         ------  ------       ---------  ----------------------  -------  ------------
      128.173.32.34                  20:4c:03:8f:53:1a  isb-mm-1  /mm/mynode       conductor  ArubaMM-HW-10K  8.10.0.9_88493  up      51d 20h 50m  no         0                       N/A      LSR
      128.173.32.35                  20:4c:03:0e:e0:44  isb-mm-2  /mm              standby    ArubaMM-HW-10K  8.10.0.9_88493  up      51d 20h 40m  no         0                       N/A      LSR
      172.16.1.11                    00:1a:1e:02:d8:90  col-md-1  /md/vt/swva/col  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 13m   no         0                       N/A      LSR
      172.16.1.12                    00:1a:1e:03:03:08  col-md-2  /md/vt/swva/col  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 13m   yes        0                       N/A      LSR
      172.16.1.13                    00:1a:1e:02:d8:f0  col-md-3  /md/vt/swva/col  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 13m   no         0                       N/A      LSR
      172.16.1.14                    00:1a:1e:03:02:78  col-md-4  /md/vt/swva/col  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 13m   yes        0                       N/A      LSR
      172.16.1.141                   00:1a:1e:03:01:98  bur-md-1  /md/vt/swva/bur  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 19m   no         0                       N/A      LSR
      172.16.1.142                   00:1a:1e:02:d8:b0  bur-md-2  /md/vt/swva/bur  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 19m   yes        0                       N/A      LSR
      172.16.1.143                   00:1a:1e:02:d9:70  bur-md-3  /md/vt/swva/bur  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 18m   no         0                       N/A      LSR
      172.16.1.144                   00:1a:1e:03:00:a8  bur-md-4  /md/vt/swva/bur  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 19m   no         0                       N/A      LSR
      172.17.1.11                    00:1a:1e:03:00:d8  res-md-1  /md/vt/swva/res  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 9m    yes        0                       N/A      LSR
      172.17.1.12                    00:1a:1e:03:01:90  res-md-2  /md/vt/swva/res  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 8m    yes        0                       N/A      LSR
      172.17.1.13                    00:1a:1e:03:11:10  res-md-3  /md/vt/swva/res  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 9m    yes        0                       N/A      LSR
      172.17.1.14                    00:1a:1e:03:0f:f8  res-md-4  /md/vt/swva/res  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 9m    yes        0                       N/A      LSR
      2607:b400:62:1400:0:16:247:11  00:1a:1e:04:b1:10  vtc-md-1  /md/vt/swva/vtc  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 3m    no         0                       N/A      LSR
      2607:b400:62:1400:0:16:247:12  00:1a:1e:04:b1:18  vtc-md-2  /md/vt/swva/vtc  MD         Aruba7240XM     8.10.0.9_88493  up      49d 7h 3m    no         0                       N/A      LSR
      172.16.236.151                 00:1a:1e:00:14:30  nvc-md-1  /md/vt/nova/nvc  MD         Aruba7220       8.10.0.9_88493  up      49d 7h 5m    no         0                       N/A      LSR
      172.16.236.152                 00:1a:1e:00:99:70  nvc-md-2  /md/vt/nova/nvc  MD         Aruba7220       8.10.0.9_88493  up      49d 7h 7m    no         0                       N/A      LSR
      
      Total Switches:18
      
    • From the MC, run a aaa user delete ... command, then check the status:
      (isb-mm-1) [mynode] #aaa user delete mac 00:11:22:33:44:55
      Users will be deleted at MDs. Please check show CLI for the status
      (isb-mm-1) [mynode] #aaa user delete mac 11:22:33:44:55:66
      The previous CLI is still in progess, please try later!
      (isb-mm-1) [mynode] #show aaa user-delete-result
      
      Summary of user delete CLI requests !
      Current user delete request timeout value: 300 seconds
      
      aaa user delete mac 00:11:22:33:44:55  , Overall Status- Response pending , Total users deleted- 0
      MD IP : 172.16.1.11, Status- Complete , Count- 0
      MD IP : 172.16.1.12, Status- Complete , Count- 0
      MD IP : 172.16.1.13, Status- Complete , Count- 0
      MD IP : 172.16.1.14, Status- Complete , Count- 0
      MD IP : 172.16.1.141, Status- Complete , Count- 0
      MD IP : 172.16.1.142, Status- Complete , Count- 0
      MD IP : 172.16.1.143, Status- Complete , Count- 0
      MD IP : 172.16.1.144, Status- Complete , Count- 0
      MD IP : 172.17.1.11, Status- Complete , Count- 0
      MD IP : 172.17.1.12, Status- Complete , Count- 0
      MD IP : 172.17.1.13, Status- Complete , Count- 0
      MD IP : 172.17.1.14, Status- Complete , Count- 0
      MD IP : 0.0.0.0, Status- Response pending , Count- 0
      MD IP : 0.0.0.0, Status- Response pending , Count- 0
      MD IP : 172.16.236.151, Status- Complete , Count- 0
      MD IP : 172.16.236.152, Status- Complete , Count- 0
      
      Note that the two MDs with IP 0.0.0.0 have a response pending. These are the two VTC MDs which are connecting the MC over IPv6.
    • After 300 seconds from when the delete command was run:
      (isb-mm-1) [mynode] #show aaa user-delete-result
      
      Summary of user delete CLI requests !
      Current user delete request timeout value: 300 seconds
      
      aaa user delete mac 00:11:22:33:44:55  , Overall Status- Complete , Total users deleted- 0
      MD IP : 172.16.1.11, Status- Complete , Count- 0
      MD IP : 172.16.1.12, Status- Complete , Count- 0
      MD IP : 172.16.1.13, Status- Complete , Count- 0
      MD IP : 172.16.1.14, Status- Complete , Count- 0
      MD IP : 172.16.1.141, Status- Complete , Count- 0
      MD IP : 172.16.1.142, Status- Complete , Count- 0
      MD IP : 172.16.1.143, Status- Complete , Count- 0
      MD IP : 172.16.1.144, Status- Complete , Count- 0
      MD IP : 172.17.1.11, Status- Complete , Count- 0
      MD IP : 172.17.1.12, Status- Complete , Count- 0
      MD IP : 172.17.1.13, Status- Complete , Count- 0
      MD IP : 172.17.1.14, Status- Complete , Count- 0
      MD IP : 0.0.0.0, Status- Timed out , Count- 0
      MD IP : 0.0.0.0, Status- Timed out , Count- 0
      MD IP : 172.16.236.151, Status- Complete , Count- 0
      MD IP : 172.16.236.152, Status- Complete , Count- 0
      
      Note the command timed out.
  • Workaround:
    • Run the command from the appropriate MD.
  • TAC case:

API timeouts

  • Description: API calls sometimes take a really long time.
  • Symptoms:
    • API calls time out.
    • API login process can return a 401.
    • TCP ACK to the API call is sent immediately, but the API response is still delayed.
  • Root cause:
    • The arci-cli-helper process is single threaded. Yes, really.
    • This process appears to be the shim between the HTTP interface of the API and the system.
    • This is less a "bug" and more of a "critical design failure".
  • Recreate the problem:
    • Make an API call for a command that takes a long time (e.g., show bss-table)
    • While that is still waiting on a response, make an API call for a command that should be nearly instant (e.g., show version).
    • Note that the second call will not get a response until the first one finishes.
  • TAC case:

aaa rfc-3576-server profiles are dumb

  • Description: An rfc-3576 message's sender is not recognized as a configured server.
  • Symptoms:


RADIUS RFC 3576 Statistics
--------------------------
Server                                 Disconnect Req  Disconnect Acc Disconnect Rej  No Secret  No Sess ID  Bad Auth  Invalid Req  Pkts Dropped Unknown service  CoA Req  CoA Acc  CoA Rej  No perm
------                                 --------------  -------------- --------------  ---------  ----------  --------  -----------  ------------ ---------------  -------  -------  -------  -------
172.28.48.84                           0               0              0               0          0           0         0            0            0                0        0        0        0
172.28.49.84                           0               0              0               0          0           0         0            0            0                0        0        0        0
2607:b400:62:9200:0:8f:ee32:b3f3       0               0              0               0          0           0         0            0            0                0        0        0        0
2607:b400:62:9200:0:95:1b5d:6dfa       0               0              0               0          0           0         0            0            0                0        0        0        0
2607:b400:92:8400:0000:0044:7dcf:5796  0               0              0               0          0           0         0            0            0                0        0        0        0
2607:b400:92:8400:0000:0046:275b:4605  0               0              0               0          0           0         0            0            0                0        0        0        0
2607:b400:92:8500:0000:0041:89db:6313  0               0              0               0          0           0         0            0            0                0        0        0        0
2607:b400:92:8500:0000:004d:be0b:1156  0               0              0               0          0           0         0            0            0                0        0        0        0

Packets received from unknown clients : 1653
Packets received with unknown request : 0
Total RFC3576 packets Received        : 1653
  • Workaround:
    • IPv6 addresses must be formatted omitting leading zeros, but also without the use of a double colon (::).
    • Different formats of the same address are recognized as different profiles.
    • Incorrect: 2607:b400:0092:8400:0000:0044:7dcf:5796
    • Incorrect: 2607:b400:92:8400::44:7dcf:5796
    • Correct: 2607:b400:92:8400:0:44:7dcf:5796

ERR_IKESA_EXPIRED

  • Description: Tunnel between MM and MD is broken.
  • Symptoms:
    • So far, this has only happened to col-md-r2:
      • controller MAC: 00:0b:86:b4:d3:a7
      • system serial: CR0001355
    • The problem has persisted after multiple factory resets.
    • Cluster VRRP address is down for the impacted MD.
  • Temporary workaround:
    • To restore the tunnel, on the MM run:
      process restart isakmpd
      
  • Long-term workaround:
    • We moved the RAPs to lcc-col and decommissioned col-md-r2.
    • Motivation was consolidation of controllers, not "fixing" this bug.
  • TAC cases:
  • JIRA tasks:

API issues

There are a lot of them.

Extra config

logging server

  • API endpoint: v1/configuration/object/log_lvl_syslog_ipv6_options
  • CLI config: logging <ipv6 addr> [options]

Best way to explain this one is to show a series of POSTing to the endpoint, show the resulting config, GETting the endpoint, then POSTing the received json blob. Notably, these operations should be invertible. That is, POSTing what was received from the GET should do nothing.

POST:

[{ "ipv6addr": "2001:468:c80:210f:0:177:fd2a:cb4e" }]

Sets:

logging 2001:468:c80:210f:0:177:fd2a:cb4e

GET/POST:

[{
  "ipv6addr": "2001:468:c80:210f:0:177:fd2a:cb4e",
  "fac": "local1",
  "lvl_severity": "warnings"
}]

Sets:

logging 2001:468:c80:210f:0:177:fd2a:cb4e facility local1 severity warnings

GET:

[{
  "ipv6addr": "2001:468:c80:210f:0:177:fd2a:cb4e",
  "facility": true,
  "fac": "local1",
  "severity": true,
  "lvl_severity": "warnings"
}]

Missing config

  • Description: Certain config items have an API object with no definition.
  • Symptoms:
    • These configuration items do not show up in the config JSON.
    • The API endpoint can still be queried directly:
      >>> # Configuration present
      >>> md.get(arubaos.api_object("telnet_cli"))
      {'_data': {'telnet_cli': {}}}
      
      >>> # Configuration not present
      >>> md.get(arubaos.api_object("telnet_cli"))
      {'_data': {'telnet_cli': None}}
      
    • Notably, all instances of this seem to have the API definition:
      { "type": "object" }
      

ipv6_enable

  • API definition file: Controller.josn
  • Full API endpoint path: v1/configuration/object/ipv6_enable
  • CLI configuration: ipv6 enable

telnet_cli

  • API definition file: Controller.json
  • Full API endpoint path: v1/configuration/object/telnet_cli
  • CLI configuration: telnet cli

telnet_soe

  • API definition file: Controller.json
  • Full API endpoint path: v1/configuration/object/telnet_soe
  • CLI configuration: telnet soe

ssh disable_dsa

  • API definition file: Authentication.json
  • Full API endpoint path: v1/configuration/object/aaa_ssh_dsa
  • CLI configuration: ssh disable_dsa

Note that when this command is present, DSA keys are disabled for ssh. Thus, when API returns:

{"_data": {"aaa_ssh_dsa": {}}}

DSA keys are disabled, counter to the natural reading of the output.

Can't upgrade via API

  • Description: Trying to execute commands used in an upgrade process via API throws permission errors.
  • Symptoms:
    • Sample interactive python session:
    >>> import arubaos.arubaos as aos
    >>> import passpy
    >>> domain = "mobility.nis.vt.edu"
    >>> creds = {"username": waldrep, "pwpath": "waldrep@vt.edu/netadmin"}
    >>> vtc1 = aos.Controller(f"vtc-md-1."{domain}, creds}
    >>> endpoint = aos.api_object("copy_scp_system")
    >>> body = {
    ...     "scphost": "2001:468:c80:210f:0:165:9b7d:7dcb",
    ...     "username": "waldrep",
    ...     "passwd": passpy.store.Store() \
    ...         .get_key("waldrep@vt.edu/conehead") \
    ...         .split('\n', maxsplit=1)[0],
    ...     "filename": "C_ArubaOS_72xx_8.7.1.5_81619",
    ...     "partition_num": "partition1"
    ... }
    >>> vtc1.post(endpoint, body)
    {'_global_result': {'status': 1, 'status_str': 'You do not have permissions to execute the commands'}}
    
    • Including or excluding the optional partition_num makes no difference.
    • Using a v4 or v6 scphost makes no difference.
    • Same error when trying to preload APs:
      >>> endpoint = aos.api_object("ap_image_preload")
      >>> body = {'ap_info': 'all-aps' }
      
  • Workaround:
    • Upload via webui or cli

Inconsistent errors for not being authenticated/authorized

  • Trying to do something without being logged in returns the HTML for the login page and a 401 code.
  • Trying to do something that the user's role is not allowed to do returns:
    • code 200
    • The following XML:
      <aruba>
        <status>Error</status>
        <reason>no permission to execute opcode/program.</reason>
      </aruba>
      
  • Trying to do something that should be allowed but is broken (like upgrading the OS image):
    • code 200
    • The following JSON:
       {
         '_global_result': {
           'status': 1,
           'status_str': 'You do not have permissions to execute the commands'
         }
       }
      
  • Using an invalid show command (e.g., show ap database long}):
    • code 200
    • empty HTTP payload

Leaking secrets

Using a read-only account to get certain items reveals secrets, such as snmp community secrets, radius keys, etc.

At least for a cluster's vrrp secret, it seems to become obfuscated on reboot.

Ordering of unordered things

Things that have no inherent ordering (such as a virtual_ap definition) is returned in a list, which is ordered. Nor is there any metadata which indicates which things are order sensitive and which are not.

Actual:

{
  "virtual_ap": [
    {
      "aaa_prof": {
        "profile-name": "aaa-eduroam"
      },
      "drop_mcast": {},
      "profile-name": "vap-eduroam",
      "ssid_prof": {
        "profile-name": "ssid-eduroam"
      },
      "vlan": {
        "vlan": "vlan-user"
      }
    },
    {
      "aaa_prof": {
        "profile-name": "aaa-vtopenwifi"
      },
      "drop_mcast": {},
      "profile-name": "vap-vtopenwifi",
      "ssid_prof": {
        "profile-name": "ssid-vtopenwifi"
      },
      "vlan": {
        "vlan": "vlan-user"
      }
    }
  ],
  "ap_group": [
    {
      "dot11a_prof": {
        "profile-name": "rpa-default"
      },
      "profile-name": "agp-ageng",
      "reg_domain_prof": {
        "profile-name": "rdp-blacksburg"
      },
      "virtual_ap": [
        {
          "profile-name": "vap-eduroam"
        },
        {
          "profile-name": "vap-vtopenwifi"
        }
      ]
    }
  ]
}

Better:

{
  "virtual_ap": {
    "_data": [
      {
        "aaa_prof": {
          "profile-name": "aaa-eduroam"
        },
        "drop_mcast": {},
        "profile-name": "vap-eduroam",
        "ssid_prof": {
          "profile-name": "ssid-eduroam"
        },
        "vlan": {
          "vlan": "vlan-user"
        }
      },
      {
        "aaa_prof": {
          "profile-name": "aaa-vtopenwifi"
        },
        "drop_mcast": {},
        "profile-name": "vap-vtopenwifi",
        "ssid_prof": {
          "profile-name": "ssid-vtopenwifi"
        },
        "vlan": {
          "vlan": "vlan-user"
        }
      }
    ],
    "_flags": {
      "ordered": false
    }
  },
  "ap_group": {
    "_data": [
      {
        "dot11a_prof": {
          "profile-name": "rpa-default"
        },
        "profile-name": "agp-ageng",
        "reg_domain_prof": {
          "profile-name": "rdp-blacksburg"
        },
        "virtual_ap": {
          "_data": [
            {
              "profile-name": "vap-eduroam"
            },
            {
              "profile-name": "vap-vtopenwifi"
            }
          ],
          "_flags": {
            "ordered": false
          }
        }
      }
    ],
    "_flags": {
      "ordered": false
    }
  }
}

Best:

{
  "virtual_ap": {
    "vap-eduroam": {
      "aaa_prof": "aaa-eduroam",
      "drop_mcast": {},
      "ssid_prof": "ssid-eduroam",
      "vlan": "vlan-user"
    },
    "vap-vtopenwifi": {
      "aaa_prof": "aaa-vtopenwifi",
      "drop_mcast": {},
      "ssid_prof": "ssid-vtopenwifi",
      "vlan": "vlan-user"
    }
  },
  "ap_group": {
    "apg-ageng": {
      "dot11a_prof": "rpa-default",
      "profile-name": "agp-ageng",
      "reg_domain_prof": "rdp-blacksburg",
      "virtual_ap": {
        "vap-eduroam": {},
        "vap-vtopenwifi": {}
      }
    }
  }
}

Unpredictable ordering

Making the above problem worse, such ordering is somewhat static. It seems to be the order is altered when the device reboots.

Initial setup

  • DNS checks fail if you do a permanent network setup without doing a temporary config first.
  • GUI password as set in the setup doesn't work. Resetting the password through the cli to the same things as setup initially makes it work.

First run

  • Initial log in goes to a "not authorized" page, which then redirects to a log out page... which does nothing. Manually going to the cluster domain again redirects to the main COP page... logged in.

Uploading a certificate

  • I have yet to be successful in uploading a PEM. Things attempted:
    • Uploading a fully cat-ed chain (i.e., leaf, key, and intermediate)
    • Uploading the root and intermediate certs explicitly as a CA, then uploading the leaf/key PEM.
  • When uploading a PEM fails, the entire HTTP process dies, and the only way to recover is to rebuild COP (or probably TAC intervention).
  • SANs are not checked when uploading a certificate. A typo here can take the whole server down.
  • Uploading a second CA cert seems to override the first.
  • What does work is uploading a PKCS12, which contains the server cert, key, and intermediate cert(s). Although, this still throws an error message. PKCS12 false error

Missing features

  • There is no way to upload the certificate from the cli.
  • There is no way to upload a server certificate without applying it. This makes it impossible to stage a change.