The Mobility Service
The mobility service is portion of the Virginia Tech network experience where end devices connect wirelessly. Today, this is limited to Wi-Fi, but is not necessarily so in the future.
About this documentation
Target Audience
This documentation is primarily internal to the team managing/supporting the mobility service. It is also a convenient way to share ideas outside of the core team, and to encourage open development.
Contributing/feedback
All contributions and feedback, on the documentation and the service itself, are welcome.
Administrative Unit Assessments
NI&S is required to set measurable goals on an annual basis. These are the goals for the wireless service for the year 2021.
Service Objective
Design and build robust and resilient IT infrastructure in support of Virginia Tech's expansion and growth.
Measure: indoor coverage
Track Wi-Fi coverage in indoor programmed spaces.
Target
Provide comprehensive indoor Wi-Fi coverage with at least -65dBm signal strength for all programmed spaces.
Progress
We already design for this. However, measuring it in a meaningful was is tricky. Some options include:
- Spot checking areas manually
- Deploy end-to-end testing beacons
- Count tickets
Each of these have their drawbacks. 1 and 2 are guaranteed to miss areas, especially corner cases that were missed in the design phase. 3 is not practical as it is not possible to programmatically extracting that data from tickets.
Measure: capacity
Track Wi-Fi capacity in indoor programmed spaces.
Target
Track Wi-Fi capacity to ensure a wireless client to access point ratio of 25:1 or better.
Progress
This is another tricky one to measure, mostly because our monitoring tools suck. AirWave and NetInsight are both going away in the next year(ish), and will be replaced with Central On-Prem (COP). We will re-evaluate when COP is deployed.
It is worth noting that clients per radio is perhaps not the best metric:
- The 2.4GHz space is noisier than 5GHz, and as such would need a lower threshold for an equivalent experience.
- A client connecting at 12Mbps uses more resources than a client connecting at 900Mbps.
- A client streaming Netflix 4k uses more resources than an idle client.
Suggestion: let's track peak airtime utilization instead. We need to find what is a good target and how to measure this.
Measure: outdoor coverage
Track Wi-Fi coverage in outdoor spaces.
Target
Expand the number of outdoor wireless access points by 20%.
Progress
Funding for this is planned. We need to determine how many (in scope) outdoor APs exist now.
Administrative Objective
Increase organizational efficiency and responsiveness.
Measure: AP provisioning
Improve AP provisioning capabilities
Target
Automate the provisioning of all standard wireless access point installations.
Progress
This one is mostly done. See enhancement task.
Complete:
- Work with devs to create a tool that will automatically provision an AP.
- The tool has 2 modes:
- REST API: on demand provisioning of an AP by passing a MAC address
- Nightly reconciler: in case something on demand is missed (or the environment changes).
- Pulls info from the controller and LLDP to determine the AP group and name.
- Creates an AP group if none exists.
- A script is written to parse an SNMP trap and fire off the REST call.
- MDs currently send traps via IPv4 to OMD (stonefly).
Incomplete:
- OMD is not correctly processing the traps it receives.
- Moving from OMD to AKIPS now that it is purchased.
- Rather than managing the SNMP/REST connector ourselves, we would rather have the web app listen to SNMP directly.
- Documentation
Service Priorities
These are thoughts on what we prioritize or value for the Wi-Fi service specifically. This is not the core values set in the IT Strategic Plan, but is (in part) an extension of them, as it applies to specifically the Wi-Fi service. This is closely related to the AUAs, but is more general / broadly scoped in nature.
Key insights
- Our users set the expectations for the service.
- Our priorities are defined by the expectations.
- The priorities drive the features and properties of the system.
- All of this is constrained and guided by our core values.
What this looks like for Wi-Fi
Expectations
- Ubiquitous and seamless coverage
- Reliable access
- High bandwidth
- Reasonable latency
Priorities
- Robust
- Systems stay up
- Systems continue to provide service when they are up
- Systems are fault tolerant
- Debugable
- Flexible
- Well understood and solved problems should be solved out of the box
- The system should easily accommodate new ideas or deployments not considered by the vendor.
- Secure
- Coherent architecture
Properties
User end
- Frictionless access
- Latest standards
- Dual stacked now
- Single stack IPv6 limitations are external
Administrator end
- Fault tolerant
- Hardware should be able to fail without impacting users
- Replacing hardware should be low risk and (relatively) low effort
- API driven
- Complete
- Idempotent
- Single stacked IPv6
- IPv6 Addresses are not strings
- Fixing one part of the system should not depend another part of the system
- Centralized (or perhaps intent based) config (within what is allowed due to above)
- Integrated monitoring
- Observable
- Is the system itself healthy?
- What is the user facing status?
- Do I have the tools to see what is going on under the covers?
- Do I have tools to identify an unknown unknown problem?
- Configuration that is difficult (preferably impossible) to get out of sync.
- (Flexible)
- Sane defaults
- Extensive options
- Building-block config
- Clear and consistent mental model to the config
- Split control plane and data plane
- OOB access
- Key/cert based access
- Idempotent API/Config
- Auditable config
- Usable config system
- Easily spun up
- Lab purposes
- Ransomware recovery
- Life cycle
- Replacing hardware
- Hardware available
- Can the vendor do business?
- Is the vendor able to ship hardware?
- Can the vendor tell us how much we owe in support (in a reasonable time frame)?
- Can we predict how much we owe and what items we need?
Other notes / questions
- Ask for a packet walkthrough
See also
- IT Strategic Plan
- Strategic Recommendations for Cloud Computing
- Bryan Cantrill talks about priorities of Programming Languages (he calls them values, and what we call values he calls principles).
Services
This is a collection of the different ways a device can connect to the wireless network.
Full documentation is a work in progress, but for now, it includes high level information on the authentication used and mechanisms available to protect the network from a misbehaving device.
eduroam
eduroam is the primary wireless network at Virginia Tech.
Authentication
Virginia Tech users are authenticated with PEAP/MSCHAPv2. Because this is a thoroughly broken protocol, these credentials are used only for network authentication.
Network
All users, VT affiliates and roaming users on VT's campus, land in vlan-users
.
Remediation
We can remove a user or device from the network in two ways.
- Disable the credentials
- VT accounts can have the network entitlement removed, effectively revoking their authorization.
- By design, VT is unable to see the individual usernames for roaming users (e.g., a Radford user on VT's campus). We can, however, see what institution their account is from. Therefore, to revoke access, we need to access the user's home institution. Since this is a process that can take some time and is not within our control, we can also block ALL authentication for that institution.
- Block the MAC address.
- This must be entered on each controller.
- The controller then denies all 802.11 authentication requests from that MAC, which prevents the device from even associating.
- This is becoming less effective as MAC randomization is increasing.
VT Open WiFi
The VT Open WiFi
SSID is an open network with no captive portal.
This network should be used by devices that cannot or should not use eduroam. The main reasons for this are:
- The device cannot do 802.1X authentication (game consoles, Chromecasts, etc).
- The device belongs to a group (e.g., department) rather than an individual, and thus does not have eduroam credentials.
- The user is a guest (and has no eduroam IdP)
Authentication
Users can connect and use the network with or without authentication. Only MAC auth is used, so no matter what, the client sees the network as an open unauthenticated network. Currently, auth is handled by ClearPass, but will soon be an instance of FreeRADIUS.
Devices can be registered in the NIS Portal. Devices can be registered as a personal device or organizational device. Any registered device is put in the Authenticated network; all other devices are in the unauthenticated network.
Quarantine
This is not yet implemented, and is subject to change. Currently, if we need to prevent a device from connecting, it is blocked by MAC address on the controller.
The database backing the FreeRADIUS authentication will include a list of banned MAC addresses. Any device connecting with a banned MAC is placed in the unauthenticated network, irrespective of registration, and is put behind a captive portal.
This captive portal will be a static page without a network login. Instead, it will display a message saying the device has been blocked and that the user should contact 4Help.
Networks
Authenticated
Authenticated devices land in the same network as eduroam users and have no restrictions. Some service owners restrict access to on campus networks, such as this one.
Devices get an RFC 1918 IPv4 address and a globally routed IPv6 address.
Unauthenticated
Unauthenticated devices land in the guest VRF.
Devices get a CG-NAT (100.64.0.0/10
) IPv4 address and a globally routed IPv6
address.
This traffic is hair-pinned at the border and is effectively treated as Internet
traffic.
There are no network ACLs artificially limiting access. However, there are services that require being connected to an "on campus" network to use them, which the unauthenticated network is not. Some services that do not work from the unauthenticated network include:
- Zoom rooms
- Digital key access for physical doors
Non-standard networks
These networks are not part of the "Virginia Tech network experience". They are deployed as work-arounds on an as-needed basis. The hope is that as the wireless service grows/evolves, these on-offs will go away.
VT_TIX
This network exists to get the wireless ticket scanners online for athletics. Because we do not have the proper RF coverage in and around the stadium, these devices cannot use the standard networks, as thousands of other clients would also try to associate, choking out the scanners.
Network
vlan-users
Special considerations
- APs with the
VT_TIX
network on them have only theVT_TIX
network on them. This limits the use of the airspace. - The network is hidden to prevent devices from automatically associating.
Long term plan
We need a full deployment of APs in and around the stadium (and other athletic areas) that will support the 65,000+ people who are there for game day. Once we have this, the scanners can use the registered device service.
Authentication
None.
Remediation
- If possible, just shut down the network (disable the virtual-ap profile).
- If it is not a good time for that, block the MAC on the controller.
VTEvent
Overview
VTEvent
is for one-off events.
It can be used to get event staff or users online.
It is an open network.
It can be a hidden or visible network.
There is also an AP group for rapid deploy units.
Scope and Purpose
Often, there are events on campus where the standard networks are not suitable.
VTEvent
fills this gap.
Deployments have a fixed start and end date/time.
Hidden Example
For example, during Relay4Life, the support staff needs a network in the
Drillfield.
Adding the VT Open WiFi
network would not be suitable, as a rapid deployment
unit would not be suitable for the density of clients.
In this example, the hidden version of VTEvent
should be deployed.
Visible example
Another example would be the SANS and VT-Hacks events, where the attendees need
to get all manner of hackerish and IoT devices online.
Normally, the registered device service would suit, but since this is an event,
we cannot guarantee everyone is a VT affiliate with access to the registered
device service.
In this example, the visible version of VTEvent
should be deployed.
Support
This is a very simple service.
There is no authentication.
The VLAN/subnet is the same as is used for the other wireless services
(vlan-users
).
As such, the most likely place for something to go wrong is communication.
Here are a few cases where something is most likely to go awry.
Hidden SSID
The SSID may be hidden.
If so, the customer will need to type in the SSID exactly correct,
case sensitive.
There is no punctuation or any unexpected characters.
For reference, the SSID is VTEvent
Limited time
Is it before the event started? Is the event over? If so, the network may not be broadcasting. It goes up and down at the times agreed upon by the customer and NEO.
Down APs
Unlike the other two examples, this one is a technical issue, not a communications issue. Usually, if the network is hidden, it will be on its own APs. As such, the problem may not be as obvious as normal. If the network is visible, it is probably broadcast from the same APs as the standard networks. This should help in determining if the APs are up.
Deployment and Cleanup
The nature of these events is that they are one offs, so it is easy to miscommunicate or leave cruft.
Communicate with the customer
Be sure to communicate with the customer what the name of the SSID is, and if the network is hidden. If deploying the hidden version, the customer will need to type it in, so be verbose. Remember that SSIDs are case sensitive!
Don't do it manually
Doing it manually is a sure way to forget the cleanup. Use an existing tool to deploy and cleanup the service. NetMRI is an excellent choice.
Create an ECO
Create an ECO for when the service is deployed and keep it open until it has been removed. Double check the tool really did clean up the service before closing the ECO.
AOS config
aaa profile
Both variations use the aaa profile aaa-open
.
This is has no layer 2 nor layer 3 authentication.
The VLAN is undefined (it is set by the virtual-ap profile).
ssid-profile
There are two SSID profiles:
ssid-VTEvent
ssid-VTEvent-hidden
Both use the ESSIDVTEvent
, with the normal data rates used elsewhere. The only difference is thatssid-VTEvent-hidden
is hidden.
virtual-ap profile
There are two virtual-ap profiles:
vap-VTEvent
vap-VTEvent-hidden
Again, they are exactly the same, exceptvap-VTEvent-hidden
usesssid-VTEvent-hidden
. Both have no authentication (layer 2 nor layer 3), and use the normal wireless VLAN.
ap-group
There are two AP groups for rapid deployment:
apg-vtevent
apg-vtevent-hidden
The only virtual-ap configured is the appropriate VTEvent
virtual-ap.
These AP groups are optimized for outdoor use (see the config below for
details).
Configuration
/md/vt
{
"aaa_prof": [
{
"default_user_role": {
"role": "ur-open"
},
"profile-name": "aaa-open"
}
],
"ssid_prof": [
{
"a_basic_rates": {
"12": "12"
},
"a_beacon_rate": {
"a_phy_rate": "12"
},
"a_tx_rates": {
"12": "12",
"18": "18",
"24": "24",
"36": "36",
"48": "48",
"54": "54"
},
"advertise_ap_name": {},
"essid": {
"essid": "VTEvent"
},
"g_basic_rates": {
"12": "12"
},
"g_beacon_rate": {
"g_phy_rate": "12"
},
"g_tx_rates": {
"12": "12",
"18": "18",
"24": "24",
"36": "36",
"48": "48",
"54": "54"
},
"max_clients": {
"max-clients": 150
},
"mcast_rate_opt": {},
"profile-name": "ssid-vtevent"
},
{
"a_basic_rates": {
"12": "12"
},
"a_beacon_rate": {
"a_phy_rate": "12"
},
"a_tx_rates": {
"12": "12",
"18": "18",
"24": "24",
"36": "36",
"48": "48",
"54": "54"
},
"advertise_ap_name": {},
"deny_bcast": {},
"essid": {
"essid": "VTEvent"
},
"g_basic_rates": {
"12": "12"
},
"g_beacon_rate": {
"g_phy_rate": "12"
},
"g_tx_rates": {
"12": "12",
"18": "18",
"24": "24",
"36": "36",
"48": "48",
"54": "54"
},
"hide_ssid": {},
"max_clients": {
"max-clients": 150
},
"mcast_rate_opt": {},
"profile-name": "ssid-vtevent-hidden"
}
],
"virtual_ap": [
{
"aaa_prof": {
"profile-name": "aaa-open"
},
"drop_mcast": {},
"profile-name": "vap-vtevent",
"ssid_prof": {
"profile-name": "ssid-vtevent"
},
"vlan": {
"vlan": "vlan-user"
}
},
{
"aaa_prof": {
"profile-name": "aaa-open"
},
"drop_mcast": {},
"profile-name": "vap-vtevent-hidden",
"ssid_prof": {
"profile-name": "ssid-vtevent-hidden"
},
"vlan": {
"vlan": "vlan-user"
}
}
]
}
/md/vt/swva
{
"ap_a_radio_prof": [
{
"eirp_max": {
"eirp-max": 127
},
"eirp_min": {
"eirp-min": 127
},
"profile-name": "rpa-outdoor"
}
],
"ap_g_radio_prof": [
{
"eirp_max": {
"eirp-max": 127
},
"eirp_min": {
"eirp-min": 127
},
"profile-name": "rpg-outdoor"
}
],
"ap_group": [
{
"dot11a_prof": {
"profile-name": "rpa-outdoor"
},
"dot11g_prof": {
"profile-name": "rpg-outdoor"
},
"profile-name": "apg-vtevent",
"reg_domain_prof": {
"profile-name": "rdp-blacksburg"
},
"virtual_ap": [
{
"profile-name": "vap-vtevent"
}
]
},
{
"dot11a_prof": {
"profile-name": "rpa-outdoor"
},
"dot11g_prof": {
"profile-name": "rpg-outdoor"
},
"profile-name": "apg-vtevent-hidden",
"reg_domain_prof": {
"profile-name": "rdp-blacksburg"
},
"virtual_ap": [
{
"profile-name": "vap-vtevent-hidden"
}
]
}
]
}
Locally bridged networks
Some locations where we have deployed remote access points (RAPs), we want the
traffic to stay local to where the AP is instead of coming back to campus.
These virtual AP profiles use the -bridged
suffix.
Currently, this is only the case at GCAPS, where the local network is managed by VTTI, not central IT.
Deployment Info
Domain
VT's deployment of the Aruba Mobility system uses the domain
mobility.nis.vt.edu
.
All hostnames are relative to this domain.
For example, the hostname foo
has the FQDN foo.mobility.nis.vt.edu
and the
hostname foo.dev
has the FQDN foo.dev.mobility.nis.vt.edu
.
Configuration Hierarchy
Design
/
├── mm
│ └── mynode
└── md
└── [org]
└── [region]
└── [cluster]
└── [device]
/
,/mm
,/md
, and/mm/mynode
are created by the system and cannot be removed/
and/md
should never be modified
Implementation
/
├── mm
│ ├── isb-mm-1
│ └── isb-mm-2
└── md
└── vt
├── swva
│ ├── bur
│ │ ├── bur-md-1
│ │ ├── bur-md-2
│ │ ├── bur-md-3
│ │ └── bur-md-4
│ ├── col
│ │ ├── col-md-1
│ │ ├── col-md-2
│ │ ├── col-md-3
│ │ └── col-md-4
│ ├── res
│ │ ├── res-md-1
│ │ ├── res-md-2
│ │ ├── res-md-3
│ │ └── res-md-4
│ └── vtc
│ ├── vtc-md-1
│ └── vtc-md-2
└── nova
└── equinix
├── equinix-md-1
└── equinix-md-2
Configuration prefixes
Configuration Item | Prefix | Configuration tier |
---|---|---|
aaa authentication captive-portal | cp- | org |
aaa authentication dot1x | dot1x- | org |
aaa authentication mac | mac- | org |
aaa authentication-server radius | asr-<server>-<service> | mm/org |
aaa profile | aaa- | org |
aaa server-group | sg- | mm/org |
ap regulatory-domain-profile | rdp- | region |
ap-group | apg- | region |
ip access-list session (allows) | acl-allow- | org |
ip access-list session (denies) | acl-deny- | org |
ip access-list session (mixed/captive) | acl-control- | org |
lcc-cluster group-profile | lcc- | cluster |
mgmt-server profile | ms- | cluster |
netdestination6 | nd6- | org |
netdestination | nd- | org |
rf arm-profile | arm- | region |
rf dot11-6GHz-radio-profile | rp6- | region |
rf dot11a-radio-profile | rpa- , rp5- | region |
rf dot11g-radio-profile | rpg- , rp2- | region |
user-role | ur- | org |
vlan-name | vlan- | org |
wlan he-ssid-profile | hessid- | org |
wlan ht-ssid-profile | htssid- | org |
wlan ssid-profile | ssid- | org |
wlan virtual-ap | vap- | org |
Production
Mobility Conductors
The devices formerly known as "Mobility Masters" (MMs).
Physical
These are in the process of being phased out.
- model: hw-mm-10k
- vlan: 100
- VRRP ID 1
Hostname | Serial | MAC | IPv4 | IPv6 |
---|---|---|---|---|
isb-mm | 128.173.32.36 | 2607:b400:2:2000:0:173:32:36 | ||
isb-mm-1 | TWK7K3503H | 20:4c:03:8f:53:1a | 128.173.32.34/27 | 2607:b400:2:2000:0:173:32:34/64 |
isb-mm-2 | TWF5K350V3 | 20:4c:03:0e:e0:44 | 128.173.32.35/27 | 2607:b400:2:2000:0:173:32:35/64 |
Virtual
NOTE | The IPv4 address listed here are reserved, but not used |
- Model: MM-VA-10K
- VLAN: 115
- VRRP ID: 20
Hostname | Product key# | IPv4 | IPv6 |
---|---|---|---|
mm | 198.82.169.229 | 2001:468:c80:210f:0:175:c1d7:3214 | |
mm-1 | ? | 198.82.169.230/24 | 2001:468:c80:210f:0:18d:616:29ba/64 |
mm-2 | ? | 198.82.169.231/24 | 2001:468:c80:210f:0:179:c946:7349/64 |
Mobility Devices (MDs)
Burruss
Management
Hostname | Serial | MAC | IPv4 | IPv6 |
---|---|---|---|---|
bur-md-1 | DL0001328 | 00:1a:1e:03:01:98 | 172.16.1.141/25 | 2607:b400:66:6000:0:16:1:141/64 |
bur-md-2 | DL0001122 | 00:1a:1e:02:d8:b0 | 172.16.1.142/25 | 2607:b400:66:6000:0:16:1:142/64 |
bur-md-3 | DL0001099 | 00:1a:1e:02:d9:70 | 172.16.1.143/25 | 2607:b400:66:6000:0:16:1:143/64 |
bur-md-4 | DL0001321 | 00:1a:1e:03:00:a8 | 172.16.1.144/25 | 2607:b400:66:6000:0:16:1:144/64 |
- Model: A7240XM
- VLAN: 399
- AP Discovery VRRP:
172.16.1.150
- AP Discovery VRRPv6:
2607:b400:66:6000:0:16:1:150
- Out of Band:
bur-oob-01.oob.cns.vt.edu
Cluster
Hostname | VRRP ID | IPv4 VIP | IPv6 VIP |
---|---|---|---|
bur-md-1 | 220 | 172.16.1.151 | 2607:b400:66:6000:0:16:1:151/64 |
bur-md-2 | 220 | 172.16.1.152 | 2607:b400:66:6000:0:16:1:152/64 |
bur-md-3 | 220 | 172.16.1.153 | 2607:b400:66:6000:0:16:1:153/64 |
bur-md-4 | 220 | 172.16.1.154 | 2607:b400:66:6000:0:16:1:154/64 |
vlan-guest
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
bur-md-1 | 800 | 172.25.8.11/22 | 2607:b400:a00:0:0:25:8:11/64 |
bur-md-2 | 800 | 172.25.8.12/22 | 2607:b400:a00:0:0:25:8:12/64 |
bur-md-3 | 800 | 172.25.8.13/22 | 2607:b400:a00:0:0:25:8:13/64 |
bur-md-4 | 800 | 172.25.8.14/22 | 2607:b400:a00:0:0:25:8:14/64 |
vlan-user
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
bur-md-1 | 1350 | 172.29.0.11/17 | 2607:b400:26:0:29:0:11/64 |
bur-md-2 | 1350 | 172.29.0.12/17 | 2607:b400:26:0:29:0:12/64 |
bur-md-3 | 1350 | 172.29.0.13/17 | 2607:b400:26:0:29:0:13/64 |
bur-md-4 | 1350 | 172.29.0.14/17 | 2607:b400:26:0:29:0:14/64 |
Coliseum
Management
Hostname | Serial | MAC | IPv4 | IPv6 |
---|---|---|---|---|
col-md-1 | DL0001121 | 00:1a:1e:02:d8:90 | 172.16.1.11/25 | 2607:b400:64:4000:0:16:1:11/64 |
col-md-2 | DL0001357 | 00:1a:1e:03:03:08 | 172.16.1.12/25 | 2607:b400:64:4000:0:16:1:12/64 |
col-md-3 | DL0001106 | 00:1a:1e:02:d8:f0 | 172.16.1.13/25 | 2607:b400:64:4000:0:16:1:13/64 |
col-md-4 | DL0001362 | 00:1a:1e:03:02:78 | 172.16.1.14/25 | 2607:b400:64:4000:0:16:1:14/64 |
- Model: A7240XM
- VLAN: 299
- AP Discovery VRRP:
172.16.1.20
- AP Discovery VRRPv6:
2607:b400:64:4000:0:16:1:20
- Out of Band:
col-oob-05.oob.cns.vt.edu
Cluster
Hostname | VRRP ID | IPv4 VIP | IPv6 VIP |
---|---|---|---|
col-md-1 | 220 | 172.16.1.21 | 2607:b400:64:4000:0:16:1:21/64 |
col-md-2 | 220 | 172.16.1.22 | 2607:b400:64:4000:0:16:1:22/64 |
col-md-3 | 220 | 172.16.1.23 | 2607:b400:64:4000:0:16:1:23/64 |
col-md-4 | 220 | 172.16.1.24 | 2607:b400:64:4000:0:16:1:24/64 |
vlan-guest
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
col-md-1 | 801 | 172.25.16.11/22 | 2607:b400:a00:1:0:25:16:11/64 |
col-md-2 | 801 | 172.25.16.12/22 | 2607:b400:a00:1:0:25:16:12/64 |
col-md-3 | 801 | 172.25.16.13/22 | 2607:b400:a00:1:0:25:16:13/64 |
col-md-4 | 801 | 172.25.16.14/22 | 2607:b400:a00:1:0:25:16:14/64 |
vlan-user
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
col-md-1 | 1250 | 172.30.0.11/17 | 2607:b400:24:0:0:30:0:11/64 |
col-md-2 | 1250 | 172.30.0.12/17 | 2607:b400:24:0:0:30:0:12/64 |
col-md-3 | 1250 | 172.30.0.13/17 | 2607:b400:24:0:0:30:0:13/64 |
col-md-4 | 1250 | 172.30.0.14/17 | 2607:b400:24:0:0:30:0:14/64 |
Residential
Management
Hostname | Serial | MAC | IPv4 | IPv6 |
---|---|---|---|---|
res-md-1 | DL0001365 | 00:1a:1e:03:00:d8 | 172.17.1.11/24 | 2607:b400:64:ba00:0:17:1:11/64 |
res-md-2 | DL0001319 | 00:1a:1e:03:01:90 | 172.17.1.12/24 | 2607:b400:64:ba00:0:17:1:12/64 |
res-md-3 | DL0001387 | 00:1a:1e:03:11:10 | 172.17.1.13/24 | 2607:b400:64:ba00:0:17:1:13/64 |
res-md-4 | DL0001417 | 00:1a:1e:03:0f:f8 | 172.17.1.14/24 | 2607:b400:64:ba00:0:17:1:14/64 |
- Model: A7240XM
- VLAN: 3199
- AP Discovery VRRP:
172.17.1.20
- AP Discovery VRRPv6:
2607:b400:64:ba00:0:17:1:20
- Out of Band:
col-oob-05.oob.cns.vt.edu
Cluster
Hostname | VRRP ID | IPv4 VIP | IPv6 VIP |
---|---|---|---|
res-md-1 | 220 | 172.17.1.21 | 2607:b400:64:ba00:0:17:1:21/64 |
res-md-2 | 220 | 172.17.1.22 | 2607:b400:64:ba00:0:17:1:22/64 |
res-md-3 | 220 | 172.17.1.23 | 2607:b400:64:ba00:0:17:1:23/64 |
res-md-4 | 220 | 172.17.1.24 | 2607:b400:64:ba00:0:17:1:24/64 |
vlan-guest
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
res-md-1 | 802 | 172.25.24.11/22 | 2607:b400:a00:10:0:25:28:11/64 |
res-md-2 | 802 | 172.25.24.12/22 | 2607:b400:a00:10:0:25:28:12/64 |
res-md-3 | 802 | 172.25.24.13/22 | 2607:b400:a00:10:0:25:28:13/64 |
res-md-4 | 802 | 172.25.24.14/22 | 2607:b400:a00:10:0:25:28:14/64 |
vlan-user
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
res-md-1 | 3200 | 172.31.0.11/17 | 2607:b400:b4:1800:0:31:0:11/64 |
res-md-2 | 3200 | 172.31.0.12/17 | 2607:b400:b4:1800:0:31:0:12/64 |
res-md-3 | 3200 | 172.31.0.13/17 | 2607:b400:b4:1800:0:31:0:13/64 |
res-md-4 | 3200 | 172.31.0.14/17 | 2607:b400:b4:1800:0:31:0:14/64 |
Equinix
Management
Hostname | Serial | MAC | IPv4 | IPv6 |
---|---|---|---|---|
equinix-md-1 | BB0001058 | 00:1a:1e:00:14:30 | 45.3.106.2/24 | 2607:b400:803:0:0:3:106:2/64 |
equinix-md-2 | BB0001964 | 00:1a:1e:00:99:70 | 45.3.106.3/24 | 2607:b400:803:0:0:3:106:3/64 |
- Model: A7220
- VLAN: 2701
- AP Discovery VRRP: N/A
- AP Discovery VRRPv6: N/A
- Out of Band:
nvc-pbx-zpe.oob.vtnis.net
Cluster
Hostname | VRRP ID | IPv4 VIP | IPv6 VIP |
---|---|---|---|
equinix-md-1 | 220 | 45.3.106.4 | 2607:b400:803:0:0:3:106:4 |
equinix-md-2 | 220 | 45.3.106.5 | 2607:b400:803:0:0:3:106:5 |
- Authenticated vlan: 2700
- Unauthenticated vlan: 808
VTC
Management
Hostname | Serial | MAC | IPv4 | IPv6 |
---|---|---|---|---|
vtc-md-1 | DL0003369 | 00:1a:1e:04:b1:10 | 172.16.247.11/23 | 2607:b400:62:1400:0:16:247:11/64 |
vtc-md-2 | DL0003377 | 00:1a:1e:04:b1:18 | 172.16.247.12/23 | 2607:b400:62:1400:0:16:247:12/64 |
- Model: A7240XM
- VLAN: 100
- AP Discovery VRRP:
172.16.247.20
- AP Discovery VRRPv6:
2607:b400:0062:1400:0:16:247:20/64
- Out of Band:
vtc-oob-01.oob.cns.vt.edu
Cluster
Hostname | VRRP ID | IPv4 VIP | IPv6 VIP |
---|---|---|---|
vtc-md-1 | 220 | 172.16.247.21 | 2607:b400:62:1400:0:16:247:11/64 |
vtc-md-2 | 220 | 172.16.247.22 | 2607:b400:62:1400:0:16:247:12/64 |
vlan-user
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
vtc-md-1 | 1750 | 172.20.24.2/22 | 2607:b400:2e:0:0:30:128:11/64 |
vtc-md-2 | 1750 | 172.20.24.3/22 | 2607:b400:2e:0:0:30:128:12/64 |
vlan-guest
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
vtc-md-1 | 811 | 172.25.48.11/23 | 2607:b400:a02:0:0:25:48:11/64 |
vtc-md-2 | 811 | 172.25.48.12/23 | 2607:b400:a02:0:0:25:48:12/64 |
Carilion networks
Hostname | Carilion-AppNet | Carilion-Wireless-WPA |
---|---|---|
VLAN ID | 327 | 305 |
vtc-md-1 | 172.16.185.3/24 | 172.16.226.3/23 |
vtc-md-2 | 172.16.185.4/24 | 172.16.226.4/23 |
Dev
Mobility Conductors
NOTE | The IPv4 address listed here are reserved, but not used |
- Model: MM-VA-500
- VLAN 115
- VRRP ID 239
Hostname | Product key# | IPv4 | IPv6 |
---|---|---|---|
mm.dev | 198.82.169.232 | 2001:468:c80:210f:0:133:6fe8:c4ef | |
mm-1.dev | MM603F362 | 198.82.169.233/24 | 2001:468:c80:210f:0:15c:3ecf:1a84/64 |
mm-2.dev | MM2D6D975 | 198.82.169.234/24 | 2001:468:c80:210f:0:1d2:4cad:7ff7/64 |
Mobility Devices
Coliseum
In band Management
Hostname | Serial | MAC | Model | IPv6 |
---|---|---|---|---|
col-md-5.dev | BB0002131 | 00:1a:1e:00:ab:38 | A7220 | 2607:b400:64:4000:0:16:1:15/64 |
col-md-6.dev | BB0002505 | 00:1a:1e:00:be:00 | A7220 | 2607:b400:64:4000:0:16:1:16/64 |
- VLAN: 299
- AP Discovery VRRP:
172.16.1.19
- AP Discovery VRRPv6:
2607:b400:64:4000:0:16:1:19
OOB Management
Hostname | IPv6 |
---|---|
col-md-5.dev | 2607:b400:e1:4000:0:0:0:15/64 |
col-md-6.dev | 2607:b400:e1:4000:0:0:0:16/64 |
col-md-7.dev | 2607:b400:e1:4000:0:0:0:17/64 |
Cluster
Hostname | VRRP ID | IPv4 VIP | IPv6 VIP |
---|---|---|---|
col-md-5.dev | 219 | 172.16.1.25 | 2607:b400:64:4000:0:16:1:25 |
col-md-6.dev | 219 | 172.16.1.26 | 2607:b400:64:4000:0:16:1:26 |
col-md-7.dev | 219 | 172.16.1.27 | 2607:b400:64:4000:0:16:1:27 |
vlan-guest
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
col-md-5.dev | 801 | 172.25.16.15/22 | 2607:b400:a00:1:0:25:16:15/64 |
col-md-6.dev | 801 | 172.25.16.16/22 | 2607:b400:a00:1:0:25:16:16/64 |
col-md-7.dev | 801 | 172.25.16.17/22 | 2607:b400:a00:1:0:25:16:17/64 |
vlan-user
Hostname | VLAN ID | IPv4 | IPv6 |
---|---|---|---|
col-md-5.dev | 1250 | 172.30.0.15/17 | 2607:b400:24:0:0:30:0:15/64 |
col-md-6.dev | 1250 | 172.30.0.16/17 | 2607:b400:24:0:0:30:0:16/64 |
col-md-7.dev | 1250 | 172.30.0.17/17 | 2607:b400:24:0:0:30:0:17/64 |
Lab
These are placeholder addresses, as these devices do not currently exist.
Management
Hostname | lab host | MAC | IPv4 | IPv6 |
---|---|---|---|---|
lab-md-1.dev | adder | 172.16.19.131/28 | 2607:b400:62:6d40:0:16:19:131/64 | |
lab-md-2.dev | cottonmouth | 172.16.19.132/28 | 2607:b400:62:6d40:0:16:19:132/64 |
- Model: N/A
- VLAN: 1499
- AP Discovery VRRP:
172.16.19.135
- AP Discovery VRRPv6:
2607:b400:62:6d40:0:16:19:135
- Out of band: N/A
APs
- IPv4 subnet:
172.16.19.144/28
- IPv6 subnet:
2607:b400:62:6d80::/64
Central on Prem
As with other things, the domain is mobility.nis.vt.edu
.
For example, the hostname central
has the FQDN central.mobility.nis.vt.edu
.
Hostname | Interface | IPv4 |
---|---|---|
central | ens1f0 | 198.82.169.222/24 |
central-node-1 | ens1f0 | 198.82.169.223/24 |
central-node-2 | ens1f0 | 198.82.169.224/24 |
central-node-3 | ens1f0 | 198.82.169.225/24 |
central-node-4 | ens1f0 | 198.82.169.226/24 |
central-node-5 | ens1f0 | 198.82.169.227/24 |
Additional VIP hostnames:
central-central
apigw-central
ccs-user-api-central
sso-central
POD IP Range: 10.0.0.0/16
Service IP Range: 10.1.0.0/16
iLO Configuration
Access credentials
- Local credentials only
- See password repository for details
Network
iLO Dedicated Network Port > IPv4:
- Not posting IPs because iLO is hella insecure. They are documented in the NEO password repo.
- DNS:
172.19.128.3
- IPv6 is currently not configured.
iLO Dedicated Network Port > SNTP:
- Disable DHCPv4/6 Supplied Time Settings
- Disable Propagate NTP Time to Host
- Primary Time Server: 172.19.131.253
- Secondary Time Server: conehead or grub
- Time Zone: Bogota, Lima, Quito, Easter Time(US & Canada) (GMT-05:00:00) NOTE: changing SNTP values will likely require an iLO reset.
Monitoring
SNMP
Management > SNMP Settings:
- System location: ISB 118
- System contact: nis-wifi-g@vt.edu
- System role: Central on Prem
- System Role Detail: Node 1, Node 2, ...
- Disable SNMPv1
- SNMPv3 Users:
- Security Name: nisnmp
- See password repo for credentials
- User Engine ID: blank
- SNMP Alert Destinations:
- akips.nis.ipv4.vt.edu
- Trap Community: blank
- SNMP Protocol: SNMPv3 Inform
- SNMPv3 User: nisnmp
Syslog
Management > Remote SNMP:
- Enable iLO Remote Syslog
- Remote Syslog Port: 514
- Remote Syslog Server: akips.nis.ipv4.vt.edu
Disable iLO Federation
iLO Federation > Setup:
- Delete the default group
- Disable multicast options:
- iLO Federation Management
- Multicast Discovery
IPv6
IPv6 is not supported at all. There is no way to configure an IPv6 address. Not only that, but when configuring the networks settings, we see:
Created symlink /etc/systemd/system/basic.target.wants/disable-ipv6.service → /etc/systemd/system/disable-ipv6.service.
smtp
Allowlist for mailrelay.smtp.vt.edu:
198.82.169.222,central.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.223,central-node-1.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.224,central-node-2.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.225,central-node-3.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.226,central-node-4.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
198.82.169.227,central-node-5.mobility.nis.vt.edu,"Central on Prem neo-central@vt.edu","NIS"
Parts for redundancy
iLO Administrator and firmware password
The iLO "Administrator" account uses a password derived from the baseband serial number. This is done by the COP installation media. The same password is used for access to the firmware interface.
NOTE: This means that the serial numbers of the nodes are sensitive information! They are stored in the NEO password vault.
The script itself derives the password with the following commands (and some unnecessary file and variable creation...):
dmidecode -t baseboard \
| grep Serial \
| grep -o '[^ ]\+$' \
| md5sum \
| grep -Eo '^[^ ]+' \
| cut -c1-8
We can simplify this to:
dmidecode -s baseboard-serial-number | md5sum | head -c 8
Managing the RAID from a live environment
HPE has a variation of secure boot enabled, so we cannot just boot to whatever we want. However, secure boot is just looking for something signed by Canonical... so just grab Ubuntu and be off. Other distros signed with common keys may or may not work, but COP is built on Ubuntu 18.04, so that is the least likely to cause issues.
Unlike the COP ISO, the Ubuntu image can be dd
'd to a USB drive to create a
bootable media.
iLO can also be used to mount virtual media to boot from.
Add HPE repositories
The ssacli
utility allows us to reconfigure the RAID setup.
The best way to get this is by adding the HPE software delivery repository
Management Component Pack.
/etc/apt/sources.list.d/mcp.list
:
# HPE Management Component Pack
deb https://downloads.linux.hpe.com/SDR/repo/mcp bionic/current non-free
Now, install the keys:
curl https://downloads.linux.hpe.com/SDR/hpPublicKey2048.pub | sudo apt-key add -
curl https://downloads.linux.hpe.com/SDR/hpPublicKey2048_key1.pub | sudo apt-key add -
curl https://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub | sudo apt-key add -
Then update the repositories:
sudo apt update
Convert array to RAID 10
This will take a long time. If building a new system, create a new array instead of migrating an existing one.
# ssacli
=> ctrl slot=0 ld 1 add drives=allunassigned
=> ctrl slot=0 ld 1 show status
logicaldrive 1 (3.49 TB, RAID 0): Transforming, 0.83%
=> ctrl slot=0 ld 1 show status
logicaldrive 1 (3.49 TB, RAID 0): Transforming, 0.83%
=> ctrl slot=0 ld 1 modify raid=1+0
=> ctrl slot=0 ld 1 show status
logicaldrive 1 (3.49 TB, RAID 1+0): Transforming, 0.07%
=> ctrl slot=0 ld 1 show status
logicaldrive 1 (3.49 TB, RAID 1+0): OK
=>
Build a new RAID 10 array
This is a destructive process, but much faster than migrating an array. It is necessary to install COP from an ISO afterwards.
# ssacli
=> ctrl slot=0 ld 1 delete
[confirm]
=> ctrl slot=0 create type=ld drives=allunassigned raid=1+0
=>
Drive replacement (RAID 0)
A failed drive in a RAID 0 array is catastrophic, thus re-installing COP from the ISO afterwards is required.
- Physically replace the bad drive with a good one
- Reboot the system
- Press
F9
during the boot to enter System Utilities, a BIOS like environment. You may need to pressF1
to continue past the warning message (telling you a drive has failed and been replaced). - Select "System Configuration"
- Select "Embedded RAID 1: HPE Smart Array P408i-a SR Gen 10"
- Select "Array Configuration"
- Select "Manage Arrays"
- Select "Array A"
- Select "List Logical Drives"
- Select "Logical Drive 1 (...)"
- Select "Re-Enable Logical Drive"
- Confirm that you want to Re-Enable the Logical Drive. We are not expecting the data to be recoverable.
- Exit the menus until you can exit the system utilities. Re-enabling the array does not count as a change, so there is no need to save.
Management
This is the stuff that helps us manage the wireless network. Various tools, automation, etc.
Down APs
Tools like AirWave or AKiPS will discover what APs are on the network and let us know when something goes down. This is good, but it doesn't tell us if the AP is expected to be down, replaced, or if a new AP has never come up. That is, it doesn't capture intent.
ATLAS is the authoritative source for intent and what is expected. The controllers are the authoritative source for what is reality.
Possible discrepancies
A non-exhaustive list of things that could be wrong:
- An AP is down
- Listed in Atlas
- Not listed or down on the controller
- Different AP is present
- MAC does not match
- AP was not removed
- Not listed in Atlas (at least, not in the list of what is expected)
- Is listed on the controller
Script
Here is a start to a script that does this comparison. Notably, it does not yet talk to ATLAS. Without real data, it is of limited use, even as a PoC.
AP Provisioning
AP Provisioning is automated with some code written by the NIS dev team. It is triggered two different ways: on demand and scheduled.
Core
Information gathering
The code ingests the MAC address of an AP. It queries the MM to determine the AP's:
- name
- group
- AAC
It then queries the AAC to get the AP's LLDP neighbor information, where it finds:
- The name of the switch
- The interface description
NOTE: For the AAC, the MM returns the IP address as used by the AP. This IP address is how the code connects to the AAC.
Generating the name and group
It uses the LLDP information to determine the building abbreviation and the HLINK, if it exists. This is used to determine the expected group and name.
The AP name takes the form BLDG-HLINK
, where:
BLDG
is the uppercase building abbreviationHLINK
is the HLINK (or link identifier for older installations)
The AP group takes the form apg-bldg
where:
apg
is a fixed prefixbldg
is the lowercase building abbreviation
Edge cases
- The HLINK may not exist yet (this is particularly common in new installations). In this case, the AP's MAC address is used in place of an HLINK.
- The AP may already be provisioned in a custom group.
If the AP's current group is of the form
apg-bldg-foo
, where-foo
is some suffix beginning with a-
, then this is considered a match, and the program will not move the AP to a different group.
Creating a group
When a program provisions an AP, it checks to make sure that the AP group exists. If it does not, it creates a group at the region level (see Configuration Hierarchy) which looks like:
{
"dot11a_prof": {
"profile-name": "rpa-default"
},
"profile-name": "apg-bldg",
"reg_domain_prof": {
"profile-name": "rdp-blacksburg"
},
"virtual_ap": [
{
"profile-name": "vap-eduroam"
},
{
"profile-name": "vap-vtopenwifi"
}
]
}
Regulatory domain
The regulatory domain is chosen based on the AAC prefix.
- If the AAC starts with,
col
,bur
, orres
then the RDP is set tordp-blacksburg
- If the AAC starts with
vtc
, then the RDP is set tordp-roanoke
. - If the AAC starts with
nvc
, then the RDP is not set. - If the AAC starts with anything else, the group is not created.
On Demand
- AP boots
- MD sends an SNMP trap to AKiPS
- Provisioning app periodically (every 2s) pull trap events from AKiPS
(specifically, the host
akips.nis.vt.edu
) - waits 5 minutes
- App looks up the AAC for that MD from the MM
- probably with the
show ap detail wired-mac xx:xx:xx:xx:xx:xx
command (via api)
- probably with the
This is how APs are provisioned when they are deployed. This also fixes APs that are moved to a new location.
- tool checks akips every 2s
- events are added with a 5 min delay
- 4 attempts with at a 5 minute interval before giving up
Scheduled
The reconciler runs at 06:00 ET daily. It pulls the AP database from the MM. It builds a list of APs that are incorrectly provisioned and runs the core process on them. This is how we get APs to have the correct name when the HLINK is assigned after the AP is deployed.
This process is limited to 20 APs per day.
Work order process
From earlyb:
The provisioning piece doesn't talk to Atlas at all. There is a WAP inventory job that does talk to Atlas. I don't remember exactly what that job does, but I think it generates a report of mismatches between the network and Atlas.
Limitations
Also from earlyb:
The only thing I can think of is that the provisioner is unable to talk to any controller that only has an IPv6 address. The docker swarm where it's deployed apparently has some problem reaching those addresses. This may be resolved in the future when we shift where it's deployed. Or maybe not.
Logs
- Currently in the ELK stack
log_aaa-*
index- 1s precision, look at the timestamp in the log
instance.name:orca-job-prod_wap-provision-*
fields.group: laa.nis.docker
Compromised user account
We occasionally get a request from ITSO to disable a user account and disconnect all associated network sessions. This is the procedure on how to do that for Wi-Fi sessions.
Find active sessions
Log into the mobility conductor (MC, previously called mobility master (MM)) via
ssh, and use the show global-user-table
command:
(isb-mm-1) [mynode] #show global-user-table list name PID
Global Users
------------
IP MAC Name Current switch Role Auth AP name Roaming Essid Bssid Phy Profile Type User Type
---------- ------------ ------ -------------- ---- ---- ------- ------- ----- ----- --- ------- ---- ---------
2607:b400:24:0:1234:5678:9abc:def c6:ea:aa:11:22:33 PID@vt.edu 172.16.1.11 ur-vt 802.1x SQUIR-238BA1077Q Wireless eduroam 48:2f:6b:a3:35:40 2.4GHz-HE aaa-eduroam N/A WIRELESS
fe80::ab:cdef:123:4abc c6:ea:aa:11:22:33 PID@vt.edu 172.16.1.11 ur-vt 802.1x SQUIR-238BA1077Q Wireless eduroam 48:2f:6b:a3:35:40 2.4GHz-HE aaa-eduroam N/A WIRELESS
2607:b400:24:0:123:4567:89ab:cdef c6:ea:aa:11:22:33 PID@vt.edu 172.16.1.11 ur-vt 802.1x SQUIR-238BA1077Q Wireless eduroam 48:2f:6b:a3:35:40 2.4GHz-HE aaa-eduroam N/A WIRELESS
172.30.123.195 c6:ea:aa:11:22:33 PID@vt.edu 172.16.1.11 ur-vt 802.1x SQUIR-238BA1077Q Wireless eduroam 48:2f:6b:a3:35:40 2.4GHz-HE aaa-eduroam N/A WIRELESS
Total entries = 4
Searching by the PID will return results for both PID
(e.g., registered
devices) and PID@vt.edu
(e.g., eduroam).
Terminate the sessions
For each unique MAC address listed in the previous step, use the
aaa user delate
command to end the sessions.
Note that deleting by the username from the MC is not currently supported.
(isb-mm-1) [mynode] #aaa user delete name PID
This command is not currently supported
(isb-mm-1) [mynode] #aaa user delete mac c6:ea:aa:11:22:33
Users will be deleted at MDs. Please check show CLI for the status
(isb-mm-1) [mynode] #show aaa user-delete-result
Summary of user delete CLI requests !
Current user delete request timeout value: 300 seconds
aaa user delete mac c6:ea:aa:11:22:33 , Overall Status- Response pending , Total users deleted- 0
MD IP : 172.16.1.11, Status- Complete , Count- 0
MD IP : 172.16.1.12, Status- Complete , Count- 0
MD IP : 172.16.1.13, Status- Complete , Count- 0
MD IP : 172.16.1.14, Status- Complete , Count- 0
MD IP : 172.16.1.141, Status- Complete , Count- 0
MD IP : 172.16.1.142, Status- Complete , Count- 0
MD IP : 172.16.1.143, Status- Complete , Count- 0
MD IP : 172.16.1.144, Status- Complete , Count- 0
MD IP : 172.17.1.11, Status- Complete , Count- 0
MD IP : 172.17.1.12, Status- Complete , Count- 0
MD IP : 172.17.1.13, Status- Complete , Count- 0
MD IP : 172.17.1.14, Status- Complete , Count- 0
MD IP : 0.0.0.0, Status- Response pending , Count- 0
MD IP : 0.0.0.0, Status- Response pending , Count- 0
MD IP : 172.16.236.151, Status- Complete , Count- 0
MD IP : 172.16.236.152, Status- Complete , Count- 0
You may notice in that example, the VTC controllers which are connecting over
IPv6 (shown as MD IP : 0.0.0.0
) still have the response pending.
This seems to be a bug.
To work around this bug, log into the appropriate MDs (reference "Current
switch" column in the global users table) and run the same command.
(col-md-1) #aaa user delete mac c6:ea:aa:11:22:33
Enhancements
These are in no particular order. Small stuff listed below. Bigger items get their own pages (see left).
blacklist script
- update local script with
no ap ap-blacklist-time
command - potentially work with devs to create orchestra app
Open Wireless Encryption (OWE)
- Tested as working on AP-225
- Not actually supported on AP-2xx
- update clearpass to expect
_owetm_
prefix and_951c89ea
suffix - disable in VTC, due to high number of existing SSIDs
Automate AP provisioning
- Related AUA
- Need to setup a trap listener. Preferably, this would be the web app would do this, instead of us translating SNMP to REST before sending to the web app.
- Traps should be sent over v6.
PAPI authentication
See ArubaOS 8.7.0.0 User Guide page 783.
Split management plane and data plane
- Make it so the controllers can only be managed from the management network.
- Make sure we are not poking holes into the management network in the process.
- Potentially the same with CPPM
Central on Prem
Mostly a to-do list, but also just ideas we might want to implement in the future.
System monitoring
- Drive failure
- PSU failure (not tested, but should work the same way as drive failures)
- Temp alerts (not tested, but should work the same way as drive failures)
-
System resources
- CPU load
- Memory usage
- Disk IO
iLO
- snmp
- syslog
-
add os agent? it may help with disk monitoring and such(we get what we want without this) - LDAP login
iLO network
- IPv6
-
remote console?not what I thought. Would have been an extra, anyways. - disable iLO federation
- document iLO config / setup
Misc
-
AKiPS groups
- Nodes
- Cluster
- iLO
EAP-TLS Project
Pre-project discussion
Timeline
- Target production date: Fall 2022
- Transition: Summer II - Fall 2022
- Maybe a transition period
- Maybe a transition point
- Onboard in service: Jan 2022
Transition options
Hard cutover
Dual profile
Dual auth
Draft of project scope
- Stake holders
- Major milestones
- Anticipated resources
- Budget
- What work needs set aside to get this done
External Resources
- Liberty University
- TJ Norton
- In process of switching to EAP-TLS
- Onboarding tool is SecureW2
- UNC (Ryan Turner)
blockers
- On boarding tool
- CA for users
- CA for auth server
Questions:
- Do we want on-boarding as a cloud SaaS?
- Do we care if the pki is in the cloud?
- Define what the cert actually asserts
- Creating a trust relationship between a device and the entity VT
- Associating a user/entity/org with that device
- Define a CPS
- Do we have a crl/ocsp? (prolly not)
- What attributes does the root CA need?
Endpoint management
We want to be able to integrate with:
- JAMF (macOS)
- InTune/AD (Windows)
- Bigfix
- macOS
- Windows
- Optional
Challenges
Certificate management
Something needs to issue the client and server certs. InCommon is ill suited for both. See the preproject page for more discussion.
Onboarding
A tool is needed to work well for BYOD and managed devices. These may not be the same tool.
Apple CNA
Apple uses a limited browser for captive portals. This can interfere with the profile provisioning tool.
Relevant educause discussion
On-boarding tool
Objective
On-board a device to the VT wireless network. This establishes trust that a device belongs to a particular entity (user or organization).
Necessity
fn main() { let project = 42; let tool = Tool { works: true, easy: true, }; if !(tool.works && tool.easy) { drop(project); } else { // println!("https://www.youtube.com/watch?v=ZXsQAXx_ao0"); println!("Let's go!"); } } struct Tool { works: bool, easy: bool, }
Values
Roughly in order:
- Interoperable: cross-platform across all major platforms
- Usable: easy to use
- Robust: hard to get wrong
- Maintainable: easy to update to keep up with new demands
- Interoperable: integrate with other tools
- Supportable:
On-boarding Tool Requirements
These are the things we will be looking for in deciding on a tool. Obviously, cost is also a consideration.
MUST have
Tools that do not meet these criteria will not be considered. These are the things that we would rather not deploy EAP-TLS than compromise on.
front end
- Platform support
- Windows 10
- Windows 11
- macOS
- iOS
- Android
- including Android 11, December 2020 patch
- manual install (Linux devices)
- Easier to use than:
- non-sponsored guest (taking into account re-registering every day)
- Current PEAP/MSCHAPv2 process (with unknown password)
- SSO integration
- remove and/or replace old profiles
back end
- Per device certs
- Certs issued to:
- User
- Organization
- Setup correct trust of server
- Set specific CA
- Set leaf CommonName / subjectAltName
- Stupidly long client cert lifetime (e.g., 50 years)
- No cloud PKI
- Ability to expand to external CA
SHOULD have
We would rather deploy without these than not deploy, but we aren't going to be happy about it.
front end
- Easier to use than:
- non-sponsored guest (not taking into account re-registering every day)
- Current PEAP/MSCHAPv2 process (with known password)
- vt.edu URL
back end
- Internal CA (with an intermediate root)
- ECC certs (P-256, or ed2519)
Low priority niceties
Extras that in particular will make future expansions of the service better.
- Passpoint support
- AD integration
- Multiple root CA support
- ed25519 support
Contenders
SecureW2
Based on feedback from peers, this is the most likely candidate. It works well and is a reasonable cost.
ClearPass Onboard
Again, based on the feedback of peers, this seems to be an excellent product, possibly better then SecureW2, but is very expensive. Even the vendor admits that it is priced too high.
Nonetheless, given we already have a CPPM instance running, it is worth taking a look at it.
Honorable mentions
eduroamCAT and geteduroam
Notably, it does not seem to support macOS1, which makes it a non-starter.
Open-source, community-driven project, with all the good and bad that comes with that. It would definitely be more effort to setup, probably more than we care to do.
Links:
Ruckus XpressConnect
Notably, we used to run XpressConnect before ditching it in favor of... nothing (with eduroamCAT as a backup). It is not likely that we are going to move back to it.
Sectigo Mobile Certificate Manager
Middleware is considering this as an option for an internal CA. It appears to have a certificate provisioning component as well.
Concerns:
- Middleware seems o be leaning toward using AWS as CA service.
- It seems prudent to not tie the on-boarding tool to the CA we are using.
- It is not clear if this will work for non-mobile platforms (e.g., Windows, macOS)
Reference [pdf][secitgo].
Authentication
Auth from a cloud service?
No. Right now our cloud exit strategy is "don't exit". The ongoing cost to maintain the eduroam authentication service is fantastically little. This makes the tradeoff between up-front engineer time and a perpetual bill from a service provider (not to mention a soft vendor lock in).
IPv6
The goal is to be able to remove any legacy IP address from the mobility infrastructure.
The expectation is that the MMs and the APs will be able to hit this mark. MDs will need a legacy IP address on VLANs that are serving a captive portal to clients, and possibly to assist with multicast discovery.
Current status
bur | col | res | nvc | vtc | |
---|---|---|---|---|---|
dns | 4 | 4 | 4 | 4 | 4,6 |
ntp | 4 | 4 | 4 | 4 | 4,6 |
syslog | 4 | 4 | 4 | 4 | 4 |
snmp traps | 4 | 4 | 4 | 6 | 6 |
cppm auth | 4 | 4 | 4 | 4 | 4 |
cp redirect | 4 | 4 | 4 | 4,6 | 4,6 |
user interface (bfv) | 4 | 4 | 4,6 | 4,6 | 4,6 |
user interface (guest) | 4,6 | 4,6 | 4,6 | 4,6 | 4,6 |
mgmt interface | 4,6 | 4,6 | 4,6 | 4,6 | 4,6 |
cluster (md-md) | 4 | 4 | 4 | 4 | 4 |
mm-md (masterip cmd, ipsec tunnel) | 4 | 4 | 4 | 4 | 4 |
AP | 4 | 4 | 4 | 4 | 4 |
captive portal server group | 4 | 4 | 4 | 4 | 4 |
Post-Mortem
Motivation
The primary goal of the out-of-band (OOB) management network is that the devices are remotely manageable in the case of disaster, when the rest of the network is not functional, and thus service can be restored.
The secondary goal/benefit of the OOB management network is for security. Isolating management to only the OOB network significantly reduces the attack surface of the equipment.
Counter motivation
The first goal is irrelevant, because the Wi-Fi network is an overlay. If the equipment is not reachable it is because the underlay is not working, and we'll be fixing that first. Two notes here:
- Administrators of the Wi-Fi network need some kind of network connectivity that isn't the VT Wi-Fi network, which is trivial. A wired adapter, home ISP, mobile hotspot, any of these will do.
- To address the case of a device with an unusable network configuration (e.g., the out of box config), they still need some kind of non-network access (i.e., serial), though that access can be reachable through network resources. Indeed, serial connection accessed through the OOB network is already part of our standard setup.
The second goal strongly implies (though doesn't strictly require) that the management of a device is isolated to that device. This is not the case with the Wi-Fi infrastructure. The configuration is all done on the MC, which is pushed to the MDs, which is in turn pushed to the APs.
More critically, there is a need for the management to have a clear separation from the production and support network. An overlay design does not lend itself to this, and sure enough, it does not exist in the wireless controllers. In particular, the controllers do not have multiple routing tables, which makes it extremely difficult if not impossible to separate the different network planes.
In particular:
- user traffic is carried to the MD inside a tunnel
- MDs in a cluster build a tunnel and have a host specific route to each other
- MDs build a tunnel and have a host specific route the MCs
This means any wireless user can reach† the management of the MC any MD in the cluster they are connected to. This could be stopped with a client ACL, but it must:
- be applied to every role
- enumerate every address (including IPv6 link local!) on every controller
This is obviously error prone and a fair bit of work, all to accomplish a secondary goal. And we still end with a design that is only a weak assurance of this goal (e.g., have we found every path into the management plane? Probably not.)
† Can reach the L4 management interface that is. Obviously, L7 still needs auth(z).
Out-Of-Band Management
Logical Diagram
Data paths
- MD join clusters with the in band management address
lc-cluster group-profile "lcc-foo" controller-v6 <blue> priority 255 mcast-vlan 0 vrrp-ip-v6 <blue> vrrp-vlan <blue> group <#>
- APs connect to cluster on in band management
- In band mgmt and user networks are trunked over the same port channel.
- MD controller IP is in band mgmt
masteripv6 ... interface-f vlan-f <blue> ... controller-ipv6 vlan <blue> address <blue>
- mgmt auth (i.e., netadmin) for MDs happens on OOB mgmt
- user auth (e.g., eduroam) happens on in band mgmt
- MC-MD management happens inside the IPsec tunnel that gets built over the in band management.
Questions
- How do we prevent mgmt login from non-OOB mgmt
networks?
If we can't do this, we haven't actually done anything.
- Force management to ports
22
and4343
, and only allow these on OOB- AP-MD and MD-MC management is done through a tunnel, thus not stopped by these ACLs. This is good for the purposes of getting things to work, but kinda violates the principles we are after to begin with.
- Captive portals use ports
80
and443
and we can force HTTPS management to exclusively4343
. This lets us expose a L7 distinction in L4. Again, this functions, but eww.
- Force management to ports
- How many captive portal users are legacy only? Do we need this legacy address?
- Can we do no legacy addresses?
- No. At the least, we need legacy addresses for RAPs.
- Can we add members to a cluster by an IP that is not the controller IP?
- Yes
- Do we want to keep a legacy address on in band mgmt
to give us time to migrate APs? (And to have less changes at once)
- Yes. Lets make less changes at once.
TODO
conehead/grub
- Add v6 addresses on the OOB mgmt [NISNETR-396]
- Accept netadmin auth from the MDs' oob mgmt [NISNETR-399]
MM
Nothing?
MD
- Wire up MDs on OOB
- Address MDs on OOB
- Apply static route to OOB network
- Apply ACLs to limit port 4343 and 22 to only be allowed on the OOB side [NISNETR-398]
-
Change
asr-conehead-netadmin
to use the OOB v6 address on conehead [NISNETR-399] -
Change
asr-grub-netadmin
to use the OOB v6 address on grub [NISNETR-399] - Figure out initial setup
- Remove remaining legacy addresses
Config changes
The MM is configured exactly the same as before. The MDs have additional configuration (col-md-5.dev as an example here):
interface gigabitethernet 0/0/0
no shutdown
!
vlan 301
description oob-mgmt
!
interface port-channel 1
gigabitethernet 0/0/0
switchport access vlan 301
switchport mode access
trusted
trusted vlan 1-4094
!
interface vlan 301
operstate up
ipv6 address 2607:b400:e1:4000:0:0:0:15/64
!
ipv6 route 2607:b400:e1:0:0:0:0:0/48 2607:b400:e1:4000:0:0:132:1
Old ideas
These are things we are currently deciding against. They are noted here in case they turn out to be a good idea or lead to other useful ideas.
MC-MD connection:
- Static routes over OOB
- IPsec tunnel between MC and FW
Remote APs
Overview
Also known as a RAP.
Steps:
- RAP IP pool on
/mm
- Public addresses
- DNS
- Cluster
IP Pool
The RAPs use an IP address inside the IPSec tunnel. The scope of this address is limited to the AP and MD, which makes it a good candidate for link local addressing. Each RAP uses 1 address, so make sure the pool has at least as many addresses as there are RAPs.
It is configured as a lc-rap-pool
at /mm
.
By convention, we use the prefix rapp-
.
CLI
Configure (at /mm
):
lc-rap-pool rapp-rap 169.254.10.10 169.254.10.50
Verify:
(isb-mm-1) [mm] #show lc-rap-pool
IP addresses used in pool rapp-rap
169.254.10.10-169.254.10.21
IPv4 pool : Total - 12 IPs used - 29 IPs free - 41 IPs configured
IPv6 pool : Total - 0 IPs used - 0 IPs free - 0 IPs configured
LC RAP Pool Total Allocs/Deallocs/Reserves : 13/0/0
LC RAP Pool Allocs/Deallocs/Reserves(succ/fail) : 12/0/(0/0)
API
Config:
{
"lc_rap_pool":[
{
"pool_end_address": "169.254.10.50",
"pool_name": "rapp-rap",
"pool_start_address": "169.254.10.10"
}
]
}
Running the show
command via API does not return (meaningfully) structured
data (last tested on AOS 8.7.1.2).
Public addresses
The key requirement is n public legacy (IPv4) addresses for n controllers in the cluster.
Documentation suggests that the public address could exist on a NAT device.
We've opted to set it up directly on the MD.
This is done just like any other vlan interface
.
It doesn't make any sense to use IPv6 with the RAP service.
- If we knew we had IPv6 connectivity from the remote location, we could just setup the AP as a campus AP (CAP) with CPSec. Improved RAP discovery with Aruba Activate may be a compelling reason to go with a RAP anyways. We haven't yet gotten that far with the RAP setup, though.
- Too many ISPs still offer legacy-only connectivity.
Also, RAPs cannot use a VRRP address to connect to the cluster, so don't bother setting up an AP discovery VIP.
DNS
- RAPs must look for the MDs by DNS (since VRRP isn't an option)
- VT uses the address
rap.mobility.nis.vt.edu
- This name must resolve to each of the public addresses of the MDs in the cluster.
- The MDs take care of load balancing once the RAP has connected, so any method DNS uses (round-robin, ordered list, etc) is fine.
$ dig +short rap.mobility.nis.vt.edu
198.82.171.142
198.82.171.141
Cluster
The only extra step here is to provide the RAP external IP.
Remember to follow the usual clustering steps as well (vlan excludes, join the md to the cluster, etc)
CLI
(isb-mm-1) [rap] #show configuration committed | begin lcc-
lc-cluster group-profile "lcc-col-rap"
controller 172.16.1.31 priority 255 mcast-vlan 299 vrrp-ip 172.16.1.41 vrrp-vlan 299 group 0 rap-public-ip 198.82.171.141
controller 172.16.1.32 priority 128 mcast-vlan 299 vrrp-ip 172.16.1.42 vrrp-vlan 299 group 0 rap-public-ip 198.82.171.142
!
API
{
"cluster_prof": [
{
"cluster_controller": [
{
"group_id": 0,
"ip": "172.16.1.31",
"mcast_vlan": 299,
"prio": 255,
"rap_public_ip": "198.82.171.141",
"vrrp_ip": "172.16.1.41",
"vrrp_vlan": 299
},
{
"group_id": 0,
"ip": "172.16.1.32",
"mcast_vlan": 299,
"prio": 255,
"rap_public_ip": "198.82.171.142",
"vrrp_ip": "172.16.1.42",
"vrrp_vlan": 299
}
],
"profile-name": "lcc-col-rap",
"vrrp_info": {
"vrrp_id": 240,
"vrrp_passphrase": ""
}
}
]
}
Monitoring
Ignore the colors. Splunk picks the colors, so red might mean accept or some other nonsense. Make sure you look at the legend.
eduroam
Row 1
- Overall distribution of requests.
- This is sourced from the authentication servers.
- Time selected from the "Recent time" picker.
Row 2
- Outcome ratios broken down by cluster
- Sourced from the authentication servers (FreeRADIUS).
- Time selected from the "Recent time" picker.
- Timestamps of these logs are based on when the server has a response prepared to send, not when it is actually sent. Notably, rejects get a 1s delay (by design).
Row 3
- Outcome ratios broken down by cluster.
- Sourced from the controllers.
- Time selected from the "Recent time" picker.
- A reject log is generated from the
dot1x-proc
process. - An accept log is generated from the
authmgr
process.- log generated when an entry is added to the user table
- log per IP address, not per authentication request.
- Typically 3-4 times as many accepts compared to row 2.
- A device that gets an accept, but is unable to get an IP address is not logged from the controller's perspective.
Row 4
- Top talkers
- Sourced from the authentication servers.
- Time selected from the "Top time" picker.
ClearPass (CPPM)
- Due to MAC auth, it is normal for there to be far more rejects than accepts.
- Extraordinarily few rejects are actually sent. Instead devices are "rejected" by not assigning a role.
- Web auth happens after the user gets an IP address.
Left column
- Outcome ratios broken down by cluster.
- Sourced from the controllers.
Right column
- Outcome ratios broken down by cluster.
- Sourced from the authentication servers (CPPM).
- For more details on recent events, check the access tracker in CPPM.
Export CPPM guest accounts (cppm 6.6)
This is all done from the Guest side of CPPM.
Enable viewing passwords
- Go to Configuration > Guest Manager
- Enable the 'Password Display' option to view guest account passwords.
Customize default export view
- Go to Guests > Export Accounts > Customize default export view
- Look for the field
password
in the list. If it is not there, click 'Add Field'. - In the "Field Name" drop box, select "password".
- Optionally, set the "Rank"
- Save Changes
- Use this view
Export the data
- On export page (Guests > Export Accounts), select what kind of export you want and save the file.
Unsorted images of the process
Factory reset CPPM
Everything is from the serial console, logged in as appadmin
.
Save licensing info
[appadmin@cppm]# show license
-------------------------------------------------------
Application : ClearPassPlatform
License key : -----BEGIN CLEARPASS PLATFORM LICENSE KEY-----
[snip!]
-----END CLEARPASS PLATFORM LICENSE KEY-----
License key type : Permanent
License added on : 2022-03-08 18:55:04
Validity : <not applicable>
Customer id : [snip!]
Licensed features : <not applicable>
=======================================================
The license key may look like a base64 encoding with a header/footer like above, or it might be formatted similar to a Windows license key.
Whatever the case, grab all the output and keep it somewhere safe.
Wipe the database
[appadmin@cppm]# cluster reset-database -f
The -f
option (think --force
) will wipe any local IP entries in the
database, as well as licensing.
Reset and Reboot
[appadmin@cppm]# system factory-reset
According to TAC, this does something close to resetting the database without
-f
.
Notably, it also reboots the box and takes you to the initial setup wizard, so
it is probably a good starting place.
Note that after the reboot, the login screen may display a message about upgrading and to not make any config changes. Press Enter occasionally until that message no longer shows before starting. It will take several minutes.
Guidelines
This is a collection of the less technical side of things. Policies, procedures, conventions, and the like are all collected here.
Administrator Access
Credentials
Password authentication is handled through:
- netadmin RADIUS instance
- single local account for backup
Key authentication is handled through:
- local accounts
Roles
- ArubaOS:
- There is a predefined, uneditable list of roles.
- Local accounts cannot be created without a role.
- RADIUS accounts set the role with
Aruba-Admin-Role
VSA.- If the VSA is missing, then a default role is applied.
- The default role is set in the "Management Authentication Profile".
- Absence of config default is
root
- API:
.mgmt_auth_profile.mgmt_default_role.aaa_auth_mgmt_default_role
- CLI:
aaa authentication mgmt default-role
- If the VSA is invalid, access is denied.
- If the VSA is missing, then a default role is applied.
- Airwave:
- Roles can be created and edited.
- Local accounts cannot be created without a role.
- A RADIUS account uses the
Arbua-Admin-Role
VSA (same as ArubaOS).- If the VSA is missing, access is denied.
- If the VSA is invalid, access is denied.
netadmin accounts
Role config:
- The default role is set to
read-only
at the highest possible nodes.
Rational:- Damage control in the case of a misconfigured RADIUS account or ArubaOS behavior change.
- All controllers are descendants of these two nodes.
- All RADIUS accounts must have the
Aruba-Admin-Role
set.
Rational:- Implicit authorization is confusing and makes it easy to miss mistakes.
- Accounts that should not have access to the wireless controllers should user
the value
deny
, or a role that is exclusive to AirWave.
Rational:- Not all netadmin accounts should have controller access.
- Some users need access to AirWave, but not the controllers.
- A bogus value is the only way to deny access to a netadmin account.
- A consistent, clear value makes for easy auditing.
Who:
- Members of the Network Engineering and Operations (NEO) team have full access.
- Support staff may have read-only access.
- This is approved by the wireless team lead or Network Operations Manager. Verbal approval is fine.
- Automation has the least access possible for its tasks.
Local accounts
Config:
To view local users via API, check:
.mgmt_user_cfg_int
.mgmt_user_ssh_pubkey
.mgmt_user_web_cacert
Or from the command line:
show mgmt-user
show mgmt-user ssh-pubkey
show mgmt-user webui-cacert
Note that each of these lists partition1 the local users.
That is show mgmt-user
will not show users with ssh pubkey access.
admin
user
This account can be created while setting up a controller. We opt to do so, as it eases the painful process of setting up an MD.
If the account is created on setup:
- The username is
admin
. - The password is set by the engineer.
- ArubaOS doesn't give a choice on either of these.
The account is created at the device level of the config hierarchy, so it overrides any other config that may be set. This creates a management headache, so we opt to remove the account after the MD connects to the MM.
The account on the MMs is a special case. Aruba, in their infinite wisdom, does not allow it to be deleted, nor the role changed. We opt to set a randomly generated password and throw it away. This effectively disables the account.
nis
user
- This account is the backup in case network connectivity is lost.
Rational:
- Entropy happens
- It is configured at the highest possible nodes.
Rational:- Entropy happens
- Centralized config makes for easy password changes
- Role is
root
.
Rational:- Full access is required to make network config changes
Server settings
Telnet
Telnet is awful and is rightfully disabled by default. We leave it disabled. Unfortunately, verifying it is still disabled is a little tricky. This config is not part of the JSON that can be retrieved from the API. Instead we must either:
- run the
show configuration
command from ssh (not the API) for each node - run
show telnet
on each device directly (either ssh or API).
To do the latter with the python library, do something like:
mm = arubaos.MobilityMaster(f"isb-mm.{domain}", creds)
for host in [md.name for md in mm.mds()]:
arubaos.Controller(f"{host}.{domain}", creds).show("telnet")["_data"]
Don't forget to check both MMs as well.
SSH
The issues that are actual exploits in the wild are not able to be configured incorrectly. There are a few knobs that are closer to theoretical weaknesses that we opt to tighten up:
- DSA < RSA
- CBC < CTR
- SHA1 < SHA256 (used in an HMAC; when used for signing, it is more serious)
In the API:
{
"aaa_ssh_cipher": {
"cipher_suite": "aes-cbc"
},
"aaa_ssh_mac": {
"hmac-sha1": true,
"hmac-sha1-96": true
}
}
Again, we find that disabling DSA (.aaa_ssh_dsa
) is missing from the config
pulled via API.
Audit trail
Commands run on the controllers, via SSH or API, can be found with the Splunk report ArubaOS command audit.
Using the set theory definition of the word here
Upgrades
Procedure
All production upgrades MUST be documented in the Engineering Change Order (ECO) app (or its replacement), and follow the normal ECO procedure.
All upgrades should be tested in the dev environment before pushing to production.
When to upgrade
ArubaOS upgrades do not occur on a regular schedule. Rather, they are as an as needed or as available basis. Several things can motivate an upgrade. Roughly in order of priority:
- Security fixes
- Stability fixes
- New features
- Incremental update available
The rational for security and stability are obviously the top priority when providing a network service. Similarly, new features allow us to provide a better or new services.
When there is a security upgrade, the system should be upgraded ASAP; target within the week. Of course, this depends on the severity of the vulnerability. For example, a CVE score of 9+ may motivate an upgrade outside the normal change window to expedite the fix. A CVE score of 3 may wait until after a maintenance restriction window (such as due to semester startup or finals).
Stability fixes should be implemented in the next change window or two, pending testing of the code.
New features can wait the longest before role out. It is more important that the system be stable and predictable than have the shiniest new feature.
Incremental updates should be implemented in 10 days of release, but not during a maintenance restriction.
What to upgrade to
Staying on the latest release within a code train has allowed us to be patched against security vulnerabilities before they are announced. When we lag behind, we have hit stability bugs that are already fixed in newer releases.
Aruba has two public release types: "Standard" and "Conservative". The conservative release is the more stable of the two.
At the time of this writing, we are on the 8.7 train, which is a standard release. It has a few key features:
- AP support
- RAP 500 series
- AP 560 series
- IPv6
- MM/MD connection
- clustering
- AirWave
Conservative release
This is the generally preferred release. Unless there there are known issues with the newer version, always go to the latest version.
Standard release
If we are already on a standard release (such as the time of this writing), stay within the same major/minor version (e.g., 8.7.0.1 to 8.7.1.2 is good).
Bugs
This is a way to keep track of bugs that we have come across.
Create a section in prework when we notice it. This provides a way to start keeping track of trends and not lose info, especially for issues that are not urgent enough to address right away.
When we start working on it in earnest (open a TAC case, create a JIRA ticket, etc), move that section to outstanding.
When the issue is resolved, move it to resolved.
If we work around an issue without resolving it, move it to workaround.
Outstanding
Controller IPv6 traffic stops
- Description: All IPv6 traffic to/from the controller itself ceases. This does not impact user traffic.
- Detection:
- The MD is reachable over IPv4, but not IPv6.
- The MD is unable to ping its IPv6 gateway.
- If the MD has established sessions (e.g., tunnels) to IPv6 addresses, those may continue to work.
- IPv6 neighbor table is stuck. It neither adds nor removes items dynamically.
- AKiPS availability
- AKiPS status
- Workaround:
- IPv6 dependencies have been (mostly?) removed. User impact should be minimal to none at this point. See Enhancements/IPv6 for details.
Bounce the link to the impacted controller. This can be done from either the controller or the router side. Additionally, it seems we can take down a single link in the port channel and bring it back up. Usually the link needs to stay down for a few seconds.- While the above does (temporarily) restore IPv6, it seems the failover mechanisms are broken, meaning the workaround is user impacting. Current practice is to leave IPv6 broken.
- Add a static neighbor entry:
This allows traffic to flow through the gateway, but does not allow traffic from the gateway itself.(isb-mm-1) [00:1a:1e:03:03:08] #show configuration committed | include neigh ipv6 neighbor 2607:b400:64:4000::1 vlan 299 00:31:46:17:df:f0
(col-md-2) *#show ipv6 route Thu Jan 18 12:52:18.483 2024 Codes: C - connected, O - OSPF, R - RIP, S - static M - mgmt, U - route usable, * - candidate default Gateway of last resort is 2607:b400:64:4000::1 to network ::/128 at cost 1 S* ::/0 [0/1] via 2607:b400:64:4000::1* C 2607:b400:64:4000::/64 is directly connected, VLAN299 C 2607:b400:a00:1::/64 is directly connected, VLAN801 (col-md-2) *#ping ipv6 2001:468:c80:210f:0:165:9b7d:7dcb Press 'q' to abort. Sending 5, 92-byte ICMPv6 Echos to 2001:468:c80:210f:0:165:9b7d:7dcb, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 0.468/0.5676/0.656 ms (col-md-2) *#ping ipv6 2607:b400:64:4000::1 Press 'q' to abort. Sending 5, 92-byte ICMPv6 Echos to 2607:b400:64:4000::1, timeout is 2 seconds: ..... Success rate is 0 percent (0/5)
- Impact:
All communication between the controllers and CPPM is over v6. Thus, no clients can authenticate on theVirginiaTech
SSID.Other system services on the controller happen over v6 including NTP and DNS.- Whatever the impact of an out-of-date neighbor table is.
- Unknowns / next steps:
- What is the impact of an incorrect neighbor table on an MD? E.g., what is the impact on a client that is not in the table? Does this impact Air Group? Efficiency/speed the MD can switch packets? Does this prevent the MD from short-circuiting or optimizing client ND?
- Is the MD participating in neighbor discovery at all? Is it sending/receiving NS? Is it sending NA?
- TAC cases:
Config out of sync
-
Description: The config on the MC device node and on the corresponding MD are different.
-
Symptoms From the MM:
(isb-mm-1) *[00:1a:1e:02:d8:90] #show configuration effective | include debugging logging user-debug 9c:b6:d0:da:1e:8f level debugging logging arm-user-debug 9c:b6:d0:da:1e:8f level debugging (isb-mm-1) *[00:1a:1e:02:d8:90] #
From the MD:
(col-md-1) *#show running-config | include debugging Building Configuration... logging security process dot1x-proc level debugging logging level debugging arm-user-debug 9c:b6:d0:da:1e:8f logging level debugging user-debug 9c:b6:d0:da:1e:8f
-
TAC case:
5360416723 -
Notes
ccm-debug full-config-sync
did not resolve the issue- Problem went away on it's own, probably from subsequent commits.
- Currently writing a script that compares the config from API
Resolved
Connectivity failures (Aruba Support Advisory ARUBA-SA-20210901-PLVL04)
-
Description: Clients have association failures.
This case morphed into the Linux client issue. Linux clients would occasionally just stop passing traffic. The device would still be associated, but it could not even ping the UAC. It was mostly observed on Intel AX200 and AX210 cards, but has also been seen on Intel's AC cards and the MediaTek MT7921K. The problem looked like a driver / kernel issue, but its disappearance is more closely correlated to upgrading to ArubaOS 8.10.
-
Symptoms:
- Clients experience association failures during high bursts of client roaming events.
- High CPU utilization by the Station Management process (
stm
) in the MDs. show papi kernel-socket-stats | include 8345,8222,8419,Drops
Drops
value onport 8419 (STM Low Priority)
rapidly increases in 100+ increments within seconds AND sustained large values forCurRxQLen
andDrops on port 8435 (STM)
,
show cpuload current
stm
process stays over 100%
-
TAC cases:
-
Notable versions:
- 8.7.1.4: observed
- 8.7.1.5: observed
- 8.7.1.6: Sanjay claims a fix
- 8.7.1.6: observed
- 8.10.0.6: presumed fixed
-
Debug: Logs requested by Rodger: Make sure user debug is enabled:
logging user-debug <client-mac> level debug
Currently enabled for waldrep's laptop (46:96:f1:03:32:98
)
no paging show cli-timestamp show clock show ap association client-mac <client-mac> show station-table | include <client-mac> show auth-tracebuf mac <client-mac> show ap client trail-info <client-mac> show datapath session table | include <ip address of client> show log user-debug 50 | include <client-mac> show log security 50 | include <client-mac> show log system 50 | include <Affected_AP_Name> tar log tech-support
Collect the following when at the time of the issue along with tech support logs:
clock cli-timestamp show dot1x watermark history show papi kernelpsocket-stats show ap debug client-mgmt-counters show ap debug sta-msg-stats show ap debug cluster-counters show ap debug gsm-counters show ap debug client-deauth-reason-counters show cpuload current show datapath bwm table show datapath utilization show datapath papi counters show datapath debug opcode show datapath network ingress show datapath maintenance counters show datapath debug dma counters show datapath message-queue counters show auth-tracebuf
Kernel panics
- Description: MD crashes with a kernel panic
- Symptoms
- MD reboots
- Kernel panic
- TAC asked for kernel core dumps. This option has been enabled for a while, but doesn't seem to be giving what they are asking for.
- Intent:cause:registers:
12:86:b0:2
12:86:b0:4
12:86:e0:2
12:86:e0:4
12:86:e0:8
78:86:50:2
(logs lost)
- Bug IDs
- AOS-216744
- TAC cases:
53530244185357725459535887783612:86:b0:4
- JIRA tasks:
- Notable versions:
- 8.5.0.11:
- Observed
12:86:e0:2
- Observed
- 8.7.1.3:
- TAC asserts fixed:
12:86:e0:2
- TAC asserts fixed:
- 8.7.1.4:
- Observed:
12:86:e0:2
- Observed:
- 8.7.1.5:
- TAC asserts fixed:
12:86:e0:2
12:86:e0:4
12:86:b0:4
- Observed:
12:86:b0:2
12:86:b0:4
12:86:e0:8
- TAC asserts fixed:
- 8.7.1.5_81619:
- Observed:
12:86:b0:4
- Observed:
- 8.7.1.6:
- TAC asserts fixed
12:86:b0:2
- TAC asserts fixed
- 8.5.0.11:
res-md-1 refuses clients
- Description: any client trying to use res-md-1 as a UAC cannot associate.
- Symptoms:
show lc-cluster load distribution client
shows 0 active and 0 standby clients for res-md-1.- started with res-md-1 crashing
- persisted across a reboot and code upgrade
- TAC cases
- Notable version:
- 8.7.1.4: crash that initiated the problem
- 8.7.1.5: observed
Holy amon logs, Batman!
- Description:
A debug trace on
amon_sender_proc
andamon_recvr_proc
is logged and cannot be disabled. Collectively, the controllers sent over 20,000 logs/s. The problem only showed up on some boots. - Bug IDs:
- AOS-210452
- TAC cases:
- Notable versions:
- 8.7.0.0: bug introduced
- 8.7.1.4: fixed
- JIRA task:
No state attribute in RADIUS request
- Description
- The RADIUS request packets do not contain the state attribute value and hence, clients face connectivity issue.
- Bug IDs
- AOS-207701
- AOS-218006
- Notable versions:
- 8.4.0.0: introduced
- 8.7.1.3: fixed
Too many pending changes
- Description
- If the expected output of
show configuration unsaved-nodes
was over 1024 characters, then it displayed nothing. - This also impacted API output.
- If the expected output of
- Bug IDs
- AOS-210404
- Notable versions:
- 8.5.0.10: observed broken
- 8.5.0.12: fixed
- 8.7.0.3: fixed
Prework
show global-user-table
crashes auth module
- Description: Running said command on the MM often returns no results and
crashes the
auth
module 1-2 times. - Symptoms:
- Running
show global-user-table list mac <mac>
hangs for about a minute, sometimes not returning anything. - When the command completely fails, it throws an error about the auth module being busy
show crashinfo
shows that the auth module crashed 1 or 2 times- Happens via ssh and api.
- Running
- Workaround:
- Check each MD directly
- Try again later
- Notable versions:
- 8.7.1.4: observed
- 8.7.1.5: observed immediately after upgrade, but haven't been able to recreate since
- 8.7.1.6: observed
APs crashing
- Description: A lot of APs crashing
- Symptoms:
- A few APs crash repeatedly (started keeping track in 8.7.1.5):
VAW-152TP01B
(res)LIB-234BA1188L
(col)
- A few APs crash repeatedly (started keeping track in 8.7.1.5):
Workaround
Delegated commands to v6 controllers fail
- Symptoms:
aaa user delete
commands from the MC do not ever get a response from the v6 controllers.- Running a second command requires waiting for the timeout (default 300 s)
- Recreate the problem:
- Have at least one MD connect to the MC over IPv6, and note which MD these
are.
To do this, configure them with the
masteripv6
orconductoripv6
command instead of themasterip
orconductorip
command.
This can be verified with the(isb-mm-1) [mynode] #cd vtc-md-1 (isb-mm-1) [00:1a:1e:04:b1:10] #show configuration committed | include conductor conductoripv6 2607:b400:2:2000:0:173:32:36 ipsec-factory-cert conductor-mac-1 20:4c:03:8f:53:1a conductor-mac-2 20:4c:03:0e:e0:44 interface-f vlan-f 100
show switches debug
command, noting which version is used in the "IP Address" column.(isb-mm-1) [mynode] #show switches debug All Switches ------------ IP Address MAC Name Nodepath Type Model Version Status Uptime CrashInfo Config Sync Time (sec) License Release Type ---------- --- ---- -------- ---- ----- ------- ------ ------ --------- ---------------------- ------- ------------ 128.173.32.34 20:4c:03:8f:53:1a isb-mm-1 /mm/mynode conductor ArubaMM-HW-10K 8.10.0.9_88493 up 51d 20h 50m no 0 N/A LSR 128.173.32.35 20:4c:03:0e:e0:44 isb-mm-2 /mm standby ArubaMM-HW-10K 8.10.0.9_88493 up 51d 20h 40m no 0 N/A LSR 172.16.1.11 00:1a:1e:02:d8:90 col-md-1 /md/vt/swva/col MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 13m no 0 N/A LSR 172.16.1.12 00:1a:1e:03:03:08 col-md-2 /md/vt/swva/col MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 13m yes 0 N/A LSR 172.16.1.13 00:1a:1e:02:d8:f0 col-md-3 /md/vt/swva/col MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 13m no 0 N/A LSR 172.16.1.14 00:1a:1e:03:02:78 col-md-4 /md/vt/swva/col MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 13m yes 0 N/A LSR 172.16.1.141 00:1a:1e:03:01:98 bur-md-1 /md/vt/swva/bur MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 19m no 0 N/A LSR 172.16.1.142 00:1a:1e:02:d8:b0 bur-md-2 /md/vt/swva/bur MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 19m yes 0 N/A LSR 172.16.1.143 00:1a:1e:02:d9:70 bur-md-3 /md/vt/swva/bur MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 18m no 0 N/A LSR 172.16.1.144 00:1a:1e:03:00:a8 bur-md-4 /md/vt/swva/bur MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 19m no 0 N/A LSR 172.17.1.11 00:1a:1e:03:00:d8 res-md-1 /md/vt/swva/res MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 9m yes 0 N/A LSR 172.17.1.12 00:1a:1e:03:01:90 res-md-2 /md/vt/swva/res MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 8m yes 0 N/A LSR 172.17.1.13 00:1a:1e:03:11:10 res-md-3 /md/vt/swva/res MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 9m yes 0 N/A LSR 172.17.1.14 00:1a:1e:03:0f:f8 res-md-4 /md/vt/swva/res MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 9m yes 0 N/A LSR 2607:b400:62:1400:0:16:247:11 00:1a:1e:04:b1:10 vtc-md-1 /md/vt/swva/vtc MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 3m no 0 N/A LSR 2607:b400:62:1400:0:16:247:12 00:1a:1e:04:b1:18 vtc-md-2 /md/vt/swva/vtc MD Aruba7240XM 8.10.0.9_88493 up 49d 7h 3m no 0 N/A LSR 172.16.236.151 00:1a:1e:00:14:30 nvc-md-1 /md/vt/nova/nvc MD Aruba7220 8.10.0.9_88493 up 49d 7h 5m no 0 N/A LSR 172.16.236.152 00:1a:1e:00:99:70 nvc-md-2 /md/vt/nova/nvc MD Aruba7220 8.10.0.9_88493 up 49d 7h 7m no 0 N/A LSR Total Switches:18
- From the MC, run a
aaa user delete ...
command, then check the status:
Note that the two MDs with IP(isb-mm-1) [mynode] #aaa user delete mac 00:11:22:33:44:55 Users will be deleted at MDs. Please check show CLI for the status (isb-mm-1) [mynode] #aaa user delete mac 11:22:33:44:55:66 The previous CLI is still in progess, please try later! (isb-mm-1) [mynode] #show aaa user-delete-result Summary of user delete CLI requests ! Current user delete request timeout value: 300 seconds aaa user delete mac 00:11:22:33:44:55 , Overall Status- Response pending , Total users deleted- 0 MD IP : 172.16.1.11, Status- Complete , Count- 0 MD IP : 172.16.1.12, Status- Complete , Count- 0 MD IP : 172.16.1.13, Status- Complete , Count- 0 MD IP : 172.16.1.14, Status- Complete , Count- 0 MD IP : 172.16.1.141, Status- Complete , Count- 0 MD IP : 172.16.1.142, Status- Complete , Count- 0 MD IP : 172.16.1.143, Status- Complete , Count- 0 MD IP : 172.16.1.144, Status- Complete , Count- 0 MD IP : 172.17.1.11, Status- Complete , Count- 0 MD IP : 172.17.1.12, Status- Complete , Count- 0 MD IP : 172.17.1.13, Status- Complete , Count- 0 MD IP : 172.17.1.14, Status- Complete , Count- 0 MD IP : 0.0.0.0, Status- Response pending , Count- 0 MD IP : 0.0.0.0, Status- Response pending , Count- 0 MD IP : 172.16.236.151, Status- Complete , Count- 0 MD IP : 172.16.236.152, Status- Complete , Count- 0
0.0.0.0
have a response pending. These are the two VTC MDs which are connecting the MC over IPv6. - After 300 seconds from when the delete command was run:
Note the command timed out.(isb-mm-1) [mynode] #show aaa user-delete-result Summary of user delete CLI requests ! Current user delete request timeout value: 300 seconds aaa user delete mac 00:11:22:33:44:55 , Overall Status- Complete , Total users deleted- 0 MD IP : 172.16.1.11, Status- Complete , Count- 0 MD IP : 172.16.1.12, Status- Complete , Count- 0 MD IP : 172.16.1.13, Status- Complete , Count- 0 MD IP : 172.16.1.14, Status- Complete , Count- 0 MD IP : 172.16.1.141, Status- Complete , Count- 0 MD IP : 172.16.1.142, Status- Complete , Count- 0 MD IP : 172.16.1.143, Status- Complete , Count- 0 MD IP : 172.16.1.144, Status- Complete , Count- 0 MD IP : 172.17.1.11, Status- Complete , Count- 0 MD IP : 172.17.1.12, Status- Complete , Count- 0 MD IP : 172.17.1.13, Status- Complete , Count- 0 MD IP : 172.17.1.14, Status- Complete , Count- 0 MD IP : 0.0.0.0, Status- Timed out , Count- 0 MD IP : 0.0.0.0, Status- Timed out , Count- 0 MD IP : 172.16.236.151, Status- Complete , Count- 0 MD IP : 172.16.236.152, Status- Complete , Count- 0
- Have at least one MD connect to the MC over IPv6, and note which MD these
are.
To do this, configure them with the
- Workaround:
- Run the command from the appropriate MD.
- TAC case:
API timeouts
- Description: API calls sometimes take a really long time.
- Symptoms:
- API calls time out.
- API login process can return a 401.
- TCP ACK to the API call is sent immediately, but the API response is still delayed.
- Root cause:
- The
arci-cli-helper
process is single threaded. Yes, really. - This process appears to be the shim between the HTTP interface of the API and the system.
- This is less a "bug" and more of a "critical design failure".
- The
- Recreate the problem:
- Make an API call for a command that takes a long time (e.g.,
show bss-table
) - While that is still waiting on a response, make an API call for a command
that should be nearly instant (e.g.,
show version
). - Note that the second call will not get a response until the first one finishes.
- Make an API call for a command that takes a long time (e.g.,
- TAC case:
aaa rfc-3576-server
profiles are dumb
- Description: An rfc-3576 message's sender is not recognized as a configured server.
-
Symptoms:
RADIUS RFC 3576 Statistics
--------------------------
Server Disconnect Req Disconnect Acc Disconnect Rej No Secret No Sess ID Bad Auth Invalid Req Pkts Dropped Unknown service CoA Req CoA Acc CoA Rej No perm
------ -------------- -------------- -------------- --------- ---------- -------- ----------- ------------ --------------- ------- ------- ------- -------
172.28.48.84 0 0 0 0 0 0 0 0 0 0 0 0 0
172.28.49.84 0 0 0 0 0 0 0 0 0 0 0 0 0
2607:b400:62:9200:0:8f:ee32:b3f3 0 0 0 0 0 0 0 0 0 0 0 0 0
2607:b400:62:9200:0:95:1b5d:6dfa 0 0 0 0 0 0 0 0 0 0 0 0 0
2607:b400:92:8400:0000:0044:7dcf:5796 0 0 0 0 0 0 0 0 0 0 0 0 0
2607:b400:92:8400:0000:0046:275b:4605 0 0 0 0 0 0 0 0 0 0 0 0 0
2607:b400:92:8500:0000:0041:89db:6313 0 0 0 0 0 0 0 0 0 0 0 0 0
2607:b400:92:8500:0000:004d:be0b:1156 0 0 0 0 0 0 0 0 0 0 0 0 0
Packets received from unknown clients : 1653
Packets received with unknown request : 0
Total RFC3576 packets Received : 1653
- Workaround:
- IPv6 addresses must be formatted omitting leading zeros, but also without
the use of a double colon (
::
). - Different formats of the same address are recognized as different profiles.
- Incorrect:
2607:b400:0092:8400:0000:0044:7dcf:5796
- Incorrect:
2607:b400:92:8400::44:7dcf:5796
- Correct:
2607:b400:92:8400:0:44:7dcf:5796
- IPv6 addresses must be formatted omitting leading zeros, but also without
the use of a double colon (
ERR_IKESA_EXPIRED
- Description: Tunnel between MM and MD is broken.
- Symptoms:
- So far, this has only happened to col-md-r2:
- controller MAC:
00:0b:86:b4:d3:a7
- system serial:
CR0001355
- controller MAC:
- The problem has persisted after multiple factory resets.
- Cluster VRRP address is down for the impacted MD.
- So far, this has only happened to col-md-r2:
- Temporary workaround:
- To restore the tunnel, on the MM run:
process restart isakmpd
- To restore the tunnel, on the MM run:
- Long-term workaround:
- We moved the RAPs to lcc-col and decommissioned col-md-r2.
- Motivation was consolidation of controllers, not "fixing" this bug.
- TAC cases:
- JIRA tasks:
API issues
There are a lot of them.
Extra config
logging server
- API endpoint:
v1/configuration/object/log_lvl_syslog_ipv6_options
- CLI config:
logging <ipv6 addr> [options]
Best way to explain this one is to show a series of POSTing to the endpoint, show the resulting config, GETting the endpoint, then POSTing the received json blob. Notably, these operations should be invertible. That is, POSTing what was received from the GET should do nothing.
POST:
[{ "ipv6addr": "2001:468:c80:210f:0:177:fd2a:cb4e" }]
Sets:
logging 2001:468:c80:210f:0:177:fd2a:cb4e
GET/POST:
[{
"ipv6addr": "2001:468:c80:210f:0:177:fd2a:cb4e",
"fac": "local1",
"lvl_severity": "warnings"
}]
Sets:
logging 2001:468:c80:210f:0:177:fd2a:cb4e facility local1 severity warnings
GET:
[{
"ipv6addr": "2001:468:c80:210f:0:177:fd2a:cb4e",
"facility": true,
"fac": "local1",
"severity": true,
"lvl_severity": "warnings"
}]
Missing config
- Description: Certain config items have an API object with no definition.
- Symptoms:
- These configuration items do not show up in the config JSON.
- The API endpoint can still be queried directly:
>>> # Configuration present >>> md.get(arubaos.api_object("telnet_cli")) {'_data': {'telnet_cli': {}}}
>>> # Configuration not present >>> md.get(arubaos.api_object("telnet_cli")) {'_data': {'telnet_cli': None}}
- Notably, all instances of this seem to have the API definition:
{ "type": "object" }
ipv6_enable
- API definition file:
Controller.josn
- Full API endpoint path:
v1/configuration/object/ipv6_enable
- CLI configuration:
ipv6 enable
telnet_cli
- API definition file:
Controller.json
- Full API endpoint path:
v1/configuration/object/telnet_cli
- CLI configuration:
telnet cli
telnet_soe
- API definition file:
Controller.json
- Full API endpoint path:
v1/configuration/object/telnet_soe
- CLI configuration:
telnet soe
ssh disable_dsa
- API definition file:
Authentication.json
- Full API endpoint path:
v1/configuration/object/aaa_ssh_dsa
- CLI configuration:
ssh disable_dsa
Note that when this command is present, DSA keys are disabled for ssh. Thus, when API returns:
{"_data": {"aaa_ssh_dsa": {}}}
DSA keys are disabled, counter to the natural reading of the output.
Can't upgrade via API
- Description: Trying to execute commands used in an upgrade process via API throws permission errors.
- Symptoms:
- Sample interactive python session:
>>> import arubaos.arubaos as aos >>> import passpy >>> domain = "mobility.nis.vt.edu" >>> creds = {"username": waldrep, "pwpath": "waldrep@vt.edu/netadmin"} >>> vtc1 = aos.Controller(f"vtc-md-1."{domain}, creds} >>> endpoint = aos.api_object("copy_scp_system") >>> body = { ... "scphost": "2001:468:c80:210f:0:165:9b7d:7dcb", ... "username": "waldrep", ... "passwd": passpy.store.Store() \ ... .get_key("waldrep@vt.edu/conehead") \ ... .split('\n', maxsplit=1)[0], ... "filename": "C_ArubaOS_72xx_8.7.1.5_81619", ... "partition_num": "partition1" ... } >>> vtc1.post(endpoint, body) {'_global_result': {'status': 1, 'status_str': 'You do not have permissions to execute the commands'}}
- Including or excluding the optional
partition_num
makes no difference. - Using a v4 or v6 scphost makes no difference.
- Same error when trying to preload APs:
>>> endpoint = aos.api_object("ap_image_preload") >>> body = {'ap_info': 'all-aps' }
- Workaround:
- Upload via webui or cli
Inconsistent errors for not being authenticated/authorized
- Trying to do something without being logged in returns the HTML for the login
page and a
401
code. - Trying to do something that the user's role is not allowed to do returns:
- code
200
- The following XML:
<aruba> <status>Error</status> <reason>no permission to execute opcode/program.</reason> </aruba>
- code
- Trying to do something that should be allowed but is broken (like upgrading
the OS image):
- code
200
- The following JSON:
{ '_global_result': { 'status': 1, 'status_str': 'You do not have permissions to execute the commands' } }
- code
- Using an invalid
show
command (e.g.,show ap database long}
):- code
200
- empty HTTP payload
- code
Leaking secrets
Using a read-only account to get certain items reveals secrets, such as snmp community secrets, radius keys, etc.
At least for a cluster's vrrp secret, it seems to become obfuscated on reboot.
Ordering of unordered things
Things that have no inherent ordering (such as a virtual_ap
definition) is
returned in a list, which is ordered.
Nor is there any metadata which indicates which things are order sensitive and
which are not.
Actual:
{
"virtual_ap": [
{
"aaa_prof": {
"profile-name": "aaa-eduroam"
},
"drop_mcast": {},
"profile-name": "vap-eduroam",
"ssid_prof": {
"profile-name": "ssid-eduroam"
},
"vlan": {
"vlan": "vlan-user"
}
},
{
"aaa_prof": {
"profile-name": "aaa-vtopenwifi"
},
"drop_mcast": {},
"profile-name": "vap-vtopenwifi",
"ssid_prof": {
"profile-name": "ssid-vtopenwifi"
},
"vlan": {
"vlan": "vlan-user"
}
}
],
"ap_group": [
{
"dot11a_prof": {
"profile-name": "rpa-default"
},
"profile-name": "agp-ageng",
"reg_domain_prof": {
"profile-name": "rdp-blacksburg"
},
"virtual_ap": [
{
"profile-name": "vap-eduroam"
},
{
"profile-name": "vap-vtopenwifi"
}
]
}
]
}
Better:
{
"virtual_ap": {
"_data": [
{
"aaa_prof": {
"profile-name": "aaa-eduroam"
},
"drop_mcast": {},
"profile-name": "vap-eduroam",
"ssid_prof": {
"profile-name": "ssid-eduroam"
},
"vlan": {
"vlan": "vlan-user"
}
},
{
"aaa_prof": {
"profile-name": "aaa-vtopenwifi"
},
"drop_mcast": {},
"profile-name": "vap-vtopenwifi",
"ssid_prof": {
"profile-name": "ssid-vtopenwifi"
},
"vlan": {
"vlan": "vlan-user"
}
}
],
"_flags": {
"ordered": false
}
},
"ap_group": {
"_data": [
{
"dot11a_prof": {
"profile-name": "rpa-default"
},
"profile-name": "agp-ageng",
"reg_domain_prof": {
"profile-name": "rdp-blacksburg"
},
"virtual_ap": {
"_data": [
{
"profile-name": "vap-eduroam"
},
{
"profile-name": "vap-vtopenwifi"
}
],
"_flags": {
"ordered": false
}
}
}
],
"_flags": {
"ordered": false
}
}
}
Best:
{
"virtual_ap": {
"vap-eduroam": {
"aaa_prof": "aaa-eduroam",
"drop_mcast": {},
"ssid_prof": "ssid-eduroam",
"vlan": "vlan-user"
},
"vap-vtopenwifi": {
"aaa_prof": "aaa-vtopenwifi",
"drop_mcast": {},
"ssid_prof": "ssid-vtopenwifi",
"vlan": "vlan-user"
}
},
"ap_group": {
"apg-ageng": {
"dot11a_prof": "rpa-default",
"profile-name": "agp-ageng",
"reg_domain_prof": "rdp-blacksburg",
"virtual_ap": {
"vap-eduroam": {},
"vap-vtopenwifi": {}
}
}
}
}
Unpredictable ordering
Making the above problem worse, such ordering is somewhat static. It seems to be the order is altered when the device reboots.
Initial setup
- DNS checks fail if you do a permanent network setup without doing a temporary config first.
- GUI password as set in the setup doesn't work. Resetting the password through the cli to the same things as setup initially makes it work.
First run
- Initial log in goes to a "not authorized" page, which then redirects to a log out page... which does nothing. Manually going to the cluster domain again redirects to the main COP page... logged in.
Uploading a certificate
- I have yet to be successful in uploading a PEM.
Things attempted:
- Uploading a fully cat-ed chain (i.e., leaf, key, and intermediate)
- Uploading the root and intermediate certs explicitly as a CA, then uploading the leaf/key PEM.
- When uploading a PEM fails, the entire HTTP process dies, and the only way to recover is to rebuild COP (or probably TAC intervention).
- SANs are not checked when uploading a certificate. A typo here can take the whole server down.
- Uploading a second CA cert seems to override the first.
- What does work is uploading a PKCS12, which contains the server cert, key, and intermediate cert(s). Although, this still throws an error message.
Missing features
- There is no way to upload the certificate from the cli.
- There is no way to upload a server certificate without applying it. This makes it impossible to stage a change.