Key Takeaways
- Aegis creates a clean operational handoff: IVI owns the platform infrastructure (hardware, firmware, hypervisor), you own what runs on it (guest OS, applications, data).
- Purpose-built monitoring combines LogicMonitor's native platform modules with custom datasources for AIM environments, providing signal quality beyond generic infrastructure monitoring.
- 24x7x365 tiered response model: Tier 1 alert triage, Tier 2 IVI engineers for root cause and remediation, Tier 3 vendor TAC escalation managed by IVI on your behalf.
- Proactive lifecycle governance tracks firmware currency, CVE monitoring, and end-of-life planning across Nutanix AOS, Cisco UCS bundles, Pure Purity//OS, and Arista EOS with 18-month advance EOL notice.
What Aegis manages — and what it does not
This scope distinction is a key qualifier. Aegis creates a clean handoff: IVI owns the platform, you own what runs on it.
This co-managed model keeps your team in control of architecture and vendor relationships while IVI handles continuous monitoring, patching, and 24x7 incident response.
What Aegis owns operationally
Physical hardware health: UCS blades/rack servers, Pure FlashArray hardware, Arista switches — monitoring, alerting, first-call response
Firmware and OS currency: UCS firmware bundles (BIOS, CIMC, VIC/NIC adapters), Nutanix AOS (cluster OS), Pure Purity//OS, Arista EOS — CVE tracking, upgrade planning, coordinated execution
Nutanix AHV hypervisor layer: Host health, AHV service status, live migration monitoring, AHV issue first call with escalation path to Nutanix TAC for product-level software defects
Nutanix storage layer: CVM health, storage pool utilization, protection domain status, replication health, disk/SSD health, cluster balance
Pure FlashArray: Capacity, IOPS, latency, throughput, volume health, replication status, drive health — first call, coordinating with Pure on Evergreen hardware events and Purity//OS upgrade scheduling
Cisco UCS: Fabric interconnect health, blade/rack server health, service profile compliance, Intersight alarm management
Arista fabric: Leaf switch health, port utilization on FI uplinks, error rates, EOS currency, BGP/MLAG state if applicable
What Aegis explicitly does not own
Guest OS: Windows, Linux, or any operating system running inside a VM remains your responsibility. IVI does not hold this scope.
Application layer: Anything running in a guest VM — databases, middleware, custom applications — is out of scope entirely
Nutanix product-level software bugs: IVI is first call on hypervisor and cluster issues, but defects in Nutanix software that require TAC engineering are escalated to Nutanix with IVI managing the ticket and coordinating resolution on your behalf
Pure hardware physical replacement: Pure's Evergreen model handles proactive hardware swap. IVI coordinates scheduling, manages change windows, and validates post-swap return to health. IVI does not physically handle hardware.
What we watch — and how we watch it
Aegis monitoring combines LogicMonitor's native platform modules with custom datasources purpose-built for AIM environments. This approach provides signal quality beyond generic infrastructure monitoring.
Nutanix monitoring (via LogicMonitor + Nutanix Prism REST API v3)
LM Native Module (standard deployment):
Cluster health status and alert aggregation, per-node CPU utilization (all-cores, ready state), per-node memory utilization and balloon driver state, CVM (Controller VM) health, CPU, and memory per node, storage pool utilization — used, free, reclaimable, storage container utilization, disk/SSD health status per node (SMART passthrough via Prism API), protection domain health and replication status, cluster replication factor compliance (RF2 / RF3 state), Nutanix cluster-level alert ingestion (pulls Prism alerts as LM events).
Custom Module Requirements:
AHV Service Health Monitor: Collects ahv_service_status per host (libvirtd, acropolis, ovs services) via Python/Groovy against Prism Central /hosts endpoint. Why custom: LM's Nutanix module monitors cluster health, not individual AHV daemon status at the host level.
VM Density and vCPU Overcommit Ratio: Collects powered-on VM count per host, vCPU:pCPU ratio per host via Prism Central /vms with host affinity grouping. Why needed: Overcommit creep is a common performance degradation path in managed environments — proactive alerting prevents reactive incidents.
Live Migration Event Monitoring: Collects in-flight AHV live migrations, migration failure events via Prism Central /tasks filtered by entity_type = live_migrate. Why needed: Migration failures often indicate storage or network path degradation before a more severe event surfaces.
Protection Domain RPO Breach Detection: Collects last_replication_time per protection domain vs. configured RPO via Prism Element API (per cluster) /protection_domains endpoint. Why needed: LM's native module shows PD health boolean, not RPO delta. Alert threshold: configurable per PD, default recommendation 20% over RPO.
Cisco UCS monitoring (via LogicMonitor + Cisco UCS / Intersight)
LM Native Module (standard deployment):
UCS Manager XML API or Intersight REST API integration, blade/rack server health and fault aggregation, Fabric Interconnect health (uptime, port status, temperature, PSU), service profile compliance status, server-level CPU, memory, adapter health, UCS system-level faults (critical/major/minor classification).
Custom Module Requirements:
FI Uplink Port Utilization (Storage VLAN paths): Collects per-port utilization on FI uplinks specifically carrying NVMe-oF and management VLANs, with separate thresholds from general uplink traffic. Why custom: LM's UCS module aggregates FI health but doesn't provide per-port utilization on specific VLAN-tagged trunk ports at the granularity needed for storage path monitoring. Source: UCS Manager ethportStats XML API object.
vNIC/vHBA Operational State: Collects operational state (up/down/degraded) of vNICs defined in service profile templates, with mapping to physical port path. Why needed: vNIC path failures can cause VM connectivity loss without a corresponding server fault alarm. Source: UCS Manager vnicEther MO.
Pure Storage monitoring (via LogicMonitor Pure FlashArray Module)
LM Native Module (standard deployment — LM has official Pure datasource):
Array-level capacity (used, provisioned, data reduction ratio), read/write IOPS, read/write latency (microsecond resolution), read/write throughput (GB/s), per-volume IOPS and latency (can identify noisy-neighbor VMs), drive health status, controller health (active/passive), replication session status and lag, Pure FlashArray alert ingestion.
Custom Module Requirements:
NVMe-oF Path Health (End-to-End): This is NOT a Pure API metric — it requires correlation between Pure (which sees initiator connections) and UCS/Arista (which see the Ethernet path). Approach: Custom module that checks: a) Pure REST API: connected hosts and NVMe-oF target sessions active, b) Arista eAPI: interface error rates on ports known to carry storage VLANs, c) Alert if: Pure shows fewer active sessions than expected host count.
Evergreen Purity//OS Version Currency: Collects current Purity//OS version vs. IVI-maintained recommended version baseline (updated when Pure releases new stable builds). Why custom: LM doesn't track firmware currency against a baseline.
Arista fabric monitoring (via LogicMonitor + SNMP / eAPI)
LM Native Module:
Interface utilization (in/out, error rates, discards), BGP peering state (if spine-leaf uses BGP), MLAG peer state (if dual-attached FI uplinks use MLAG), LLDP neighbor state (confirms physical topology matches expected), CPU and memory utilization per switch, EOS hardware health (power supply, fan, temperature).
Custom Module Requirements:
Deep Buffer Utilization (Critical for NVMe-oF storage paths): Arista switches used in AIM environments (7050X, 7060X series) have deep buffers specifically for storage traffic. Default SNMP doesn't expose queue depth / buffer utilization. Source: Arista eAPI "show hardware capacity" and "show interfaces
Storage VLAN Path Continuity: Confirms that the storage VLAN (NVMe-oF VLAN, typically dedicated) is present and active on all expected trunk ports between FIs and leaves. Source: Arista eAPI "show vlan" filtered to storage VLANs. Alert if: storage VLAN missing from any expected port.
How we respond when something breaks
Aegis provides 24x7x365 monitoring with paged response. Tiered escalation: Tier 1 (alert triage, isolation) → Tier 2 (IVI engineers, root cause, remediation) → Tier 3 (vendor TAC escalation managed by IVI).
Incident ownership by category
Physical Hardware Failure:
UCS blade/rack server failure — IVI Tier 2 owns triage, Cisco TAC for RMA. UCS Fabric Interconnect failure — IVI Tier 2 owns, Cisco TAC for RMA. Pure FlashArray drive failure — IVI coordinates, Pure Evergreen handles proactive replacement under support contract. Pure controller failure — IVI first call, Pure TAC + Evergreen hardware replacement path. Arista switch hardware failure — IVI Tier 2 owns, Arista TAC for RMA. All RMA logistics and vendor ticket ownership remain with IVI.
Nutanix Platform Issues:
CVM failure or CVM service failure on a node — IVI Tier 2 first call. Recovery: CVM restart, node isolation, cluster rebalance as needed. Escalate to Nutanix TAC if CVM fails to recover or if issue is in AOS code. AOS cluster health degradation (RF failure, metadata ring issues) — IVI first call, Nutanix TAC escalation for software defects. AHV hypervisor service failure (libvirtd, acropolis agent) — IVI first call. Recovery: service restart, host evacuation, host maintenance mode as needed. Escalate to Nutanix if AHV itself is the defect source. VM live migration failure — IVI owns. Diagnose: storage path, network path, destination host capacity. Remediate. Escalate only if AHV defect confirmed. Protection domain replication failure — IVI owns. Diagnose and remediate connectivity, reseed if required. Escalate to Nutanix for software defects. Storage performance degradation — IVI owns full investigation across Pure (array-side), Arista (path), and UCS (host adapter) to isolate.
What IVI explicitly does not respond to
IVI will NOT respond to: Guest OS failures (Windows BSOD, Linux kernel panic, OS corruption), application failures inside VMs (database crashes, middleware failures), application performance issues (slow query, high application CPU), user account and access issues within guest OS, backup/restore of guest data (unless separately contracted).
IVI will confirm the VM is running and that the hypervisor/storage/network layers are healthy — any issue above the hypervisor is referred to your team.
Escalation path and SLA
You must have valid Nutanix, Pure, and Cisco support contracts active. IVI manages and owns these vendor TAC relationships during incidents. IVI provides a single point of contact. You never need to call vendors directly.
Staying current without the operational burden
Platform currency requires disciplined tracking across multiple vendor release cycles. Aegis manages this operational overhead while keeping you in control of change timing.
What IVI tracks and manages
Nutanix AOS: Current production AOS version tracked against Nutanix release calendar, LTS (Long Term Support) vs. STS (Short Term Support) branch awareness, CVE tracking: IVI monitors Nutanix security advisories and PSIRT notices, upgrade planning: pre-upgrade compatibility checks (AHV version, NCC version, LCM catalog requirements), scheduling, execution, post-upgrade validation, Foundation and NCC (Nutanix Cluster Check) currency also managed.
Cisco UCS: UCS firmware bundle currency (covers BIOS, CIMC, VIC/NIC firmware, FI NX-OS), managed via Intersight in Intersight Managed Mode (IMM), Intersight's firmware policy framework used to track compliance and stage upgrades across the UCS domain, CVE tracking for UCS platform advisories.
Pure Storage (Purity//OS): Purity//OS version tracking against Pure's recommended release channel, Pure's Evergreen model provides non-disruptive upgrades — IVI coordinates scheduling and manages change windows around business requirements, Pure1 (Pure's cloud management) monitors array health proactively; IVI uses this in conjunction with LM for dual-layer coverage, drive and controller proactive replacement coordination (Pure Evergreen).
Arista EOS: EOS train tracking (specific train vs. maintenance releases), CVE monitoring via Arista security advisories, upgrade coordination with existing change control processes, eAPI and OpenConfig compatibility maintained through upgrades.
End-of-life tracking
IVI maintains an EOL register for all hardware in the AIM environment. 18-month advance notice for hardware approaching EOS/EOL feeds into IVI's lifecycle-driven refresh planning conversations.
Who this is for
This co-managed services model is designed for organizations that need operational discipline without operational ownership transfer.
This model fits when you have:
Completed or are completing an AIM modernization engagement and need ongoing operational support for the modernized environment. Strategic skills to own architecture and vendor relationships but lack bandwidth for continuous monitoring, patching, and 24x7 incident response. Experienced infrastructure incidents caused by firmware drift, missed patching cycles, or unmonitored hardware degradation and want a structured operational model to prevent recurrence. Mid-market scale (typically 200-5000 employees) where a dedicated infrastructure operations team is not cost-justified but the infrastructure complexity demands it. Compliance requirements (SOC 2, HIPAA, PCI) where documented change control, firmware currency, and incident response records are required — Aegis provides the documentation and audit trail.
This model does not fit when you need:
A fully outsourced IT department — IVI does not manage guest OS, helpdesk, or end-user computing. Support for generic x86 hardware without UCS, Pure, or Nutanix in the stack — IVI's managed practice is purpose-built around these specific platforms.
What buyers evaluate when considering Aegis
Can I maintain control of my environment while getting operational support? Yes. Co-managed model — IVI operates within your change control process. You approve all changes. IVI executes and documents.
What happens when something breaks at 2am? IVI's NOC is monitoring 24x7. Paged on critical alerts. IVI Tier 2 engineers respond, not a helpdesk. You are notified per defined escalation procedure once situation is assessed.
How does this interact with our existing vendor support contracts? You maintain Nutanix, Pure, and Cisco contracts. IVI manages those relationships as the operator — IVI opens and owns TAC cases, coordinates RMA logistics, and manages resolution timeline.
How is this different from just buying monitoring software ourselves? LogicMonitor alone shows you what's happening. Aegis adds the engineers who respond, the operational process to act on it, the lifecycle discipline to prevent it, and the vendor relationship to resolve it. Tools without process is just a more expensive way to get paged at 2am.
What does onboarding look like? IVI conducts an environment baseline (typically 2-3 weeks for AIM environments), deploys LM collectors, configures all datasources and custom modules, establishes alert thresholds, integrates with your ITSM, and conducts a runbook workshop. Target: fully operational Aegis coverage within 4-6 weeks for a standard AIM environment.