# My CVE-Consumption User Story

> **`Author`**: ***Keerthana Purushotham***
> 
> As a software developer on the Threat Management team for a **leading global cloud provider supporting mission-critical workloads across industries**, I struggle with inconsistent and malformed vulnerability data. Records from upstream sources often violate expected formats, omit critical fields, or contain extreme outliers that disrupt automated processing and validation workflows.
> 


## Problem Description

The Threat Management team operates as the **first node in the provider’s vulnerability-intelligence pipeline**, aggregating data from multiple upstream sources—including the [National Vulnerability Database (NVD)](https://nvd.nist.gov/), [Red Hat Security Data API](https://access.redhat.com/labs/securitydataapi/), [Debian](https://security-tracker.debian.org/tracker/), [SUSE](https://www.suse.com/security/cve/index.html), and other major ecosystems—before distribution to internal detection, response, and customer-facing systems.  
The internal data service runs continuously across three core phases—**Ingest**, **Variance**, and **Remediation**—to validate, normalize, and enrich vulnerability data before it propagates downstream.

However, the inconsistent quality and structure of upstream CVE data frequently create operational and engineering challenges. Records often arrive with missing or malformed fields such as unset identifiers, non-standard JSON structure, or text-based descriptions that are either empty or excessively long—sometimes embedding entire commit logs. These anomalies corrupt ingestion workflows, requiring engineers to maintain extensive defensive logic just to sanitize and reconcile invalid entries.

Fuzzy matching and manual verification are routinely necessary to identify affected packages accurately, as upstream feeds frequently lack consistent ecosystem tags, vendor attribution, or version boundaries. Descriptions may be overly terse or verbose, while duplicated CVE text across IDs generates confusion in tracking and remediation. Divergent CVSS metrics between CNA and ADP sources further distort severity scoring and downstream prioritization, forcing human analysts to arbitrate inconsistencies that could be automatically validated at source.

Given that a **small team maintains three services spanning more than 20 000 lines of code**—responsible for **ingestion, parsing, variance detection, validation, remediation status, and SLA/SLO tracking** for CVEs impacting a large number of global services—these recurring anomalies impose a disproportionate operational burden and reduce confidence in automation. Every major vulnerability consumer—from cloud-scale providers to open-source maintainers—duplicates similar normalization logic simply to process malformed data.

---

### References

1. [Purushotham, K. (2025, September 19). *Accuracy is not enough: Confusion matrix metrics that actually work in CVE impact prediction.* Zenodo. DOI: [10.5281/zenodo.17438182](https://doi.org/10.5281/zenodo.17438182).  
    - Also available via [Substack](https://keerthanapurushotham.substack.com/p/accuracy-is-not-enough-confusion), [Medium](https://medium.com/@keerthanapurushotham/accuracy-is-not-enough-confusion-matrix-metrics-that-actually-work-in-cve-impact-prediction-d4bafd9cec1b), and [GitHub](https://github.com/keerthanap8898/Accuracy-is-Not-Enough-in-Cybersecurity).
2. [Purushotham, K. (2025). *CVE User Story Description.*](https://github.com/keerthanap8898/CveToad/blob/main/CVE-user-story_Description.md)
3. [**CveToad**](https://github.com/keerthanap8898/CveToad) - I'll be following up with more in-depth work in this domain at this repository of mine.
    - Especially see [***CVE_Metrics_Framework - Description***](https://github.com/keerthanap8898/CveToad/blob/main/CVE-user-story_Description.md)


**`Author`**: ***Keerthana Purushotham***.
>   - **`GitHub`**: *[github.com/keerthanap8898](https://github.com/keerthanap8898)*
>   - **`LinkedIn`**: *[linkedin.com/in/keerthanapurushotham](https://linkedin.com/in/keerthanapurushotham)*
> ---
> 
### **`Appendix`**
> 
> #### **Proposed Solution & Impact**:
> 
> Systemic improvement must occur **upstream**, at the **CVE publication and registry level**.  
> Enforcing **schema validation**, **type safety**, and **field-length constraints** at CNA submission time would prevent malformed data from reaching global consumers. Integrating **canonical identifiers** (e.g., *package URLs*) and enforcing **semantic versioning** across CNAs would eliminate substring collisions and misattribution of affected components.
> 
> Enhanced **cross-source variance checking**—comparing CNA vs. NVD vs. ADP records in near-real time—could preempt divergence before distribution. Similarly, automated linting of text fields could remove HTML fragments, excessive changelogs, or invalid Unicode.  
> Persistent `CWE-Other` and `NVD-CWE-noinfo` entries should trigger SLA-based feedback loops back to the originating CNA for refinement.
> 
> These measures would significantly improve reliability across the CVE data supply chain, reducing redundant engineering and increasing analytical trust in vulnerability-intelligence pipelines.
> By addressing these foundational quality issues within the global CVE ecosystem, the security community can transform today’s fragmented ingestion practices into a unified, trustworthy data pipeline—one capable of powering reliable automation, accelerating remediation, and ultimately strengthening cyber resilience across the world’s most critical infrastructures.
> 
> ---
>
> ### **CVE Meta-data Framework Table**:
> ![CVE_Meta-data_Framework_Table](https://github.com/keerthanap8898/CveToad/blob/main/Resources/Images/CVE_Meta-data_Framework_Table.jpg)
>
> ---


— *Keerthana Purushotham*
