Published August 22, 2025 | Version 1.0
Dataset Open

Replication data for: Gender Disparity in U.S. Patenting

  • 1. EDMO icon Stony Brook University
  • 2. ROR icon Arizona State University

Description

This dataset provides inventor-, organization-, and patent-level information on U.S. utility patents (1976–2021). It has been curated to enable research on gender disparities in patenting, inventor team composition, organizational characteristics, and innovation outcomes. The dataset is based on disambiguated inventor, assignee, and patent information, enriched with bibliometric, geographic, and citation indicators.

The dataset consists of three CSV files:

1. 01_distinct_inventor_information.csv

Unit of analysis: Unique inventors.

Description: Contains demographic, geographic, and innovation-related characteristics for distinct inventors in the dataset.

Key variables include:

  • inventor_id: Unique inventor identifier.

  • inventor_first_name, inventor_last_name: Disambiguated inventor names.

  • patent_count: Number of patents linked to the inventor.

  • gender_code, gender_evaluation_method: Assigned gender and method of inference.

  • first_filing_year, last_filing_year: Patent activity period.

  • first_author_patent_count: Number of patents with inventor listed first.

  • Technological scope: Counts of CPC subclasses and subgroups per inventor.

  • Backward and forward citations: Sums and means across patents.

  • Bibliometric indicators: Originality, generality, combinatorial novelty (cd_5, cd_2017y).

  • Science linkage: Number of cited scientific papers.

  • Geographic information: City, state, country, county, latitude/longitude, and FIPS codes.

2. 02_distinct_organizational_assignee_information.csv

Unit of analysis: Unique organizational assignees.

Description: Summarizes the characteristics of distinct organizational assignees, including patenting activity, gender composition of inventor teams, and bibliometric indicators.

Key variables include:

  • assignee_id, disambig_assignee_organization: Unique ID and disambiguated organization name.

  • patent_count: Number of assigned patents.

  • assignee_type, assignee_type_name, assignee_type_name_adj: Organization type (e.g., firm, university, government).

  • first_filing_year, last_filing_year: Patent activity period.

  • Inventor gender composition: Male, female, undefined counts; all-male, all-female, and gender-collaboration team measures.

  • Technological scope: Mean counts of CPC section, subclass, and group.

  • Citation measures: Backward and forward citations, scientific publication citations.

  • Bibliometric indicators: Originality, generality, combinatorial novelty.

  • Gender ratios: Fraction of patents with women inventors, team gender ratios.

3. 03_utility_patent_information.csv

Unit of analysis: Individual utility patents.

Description: Provides patent-level information, including bibliometric measures, team composition, organizational assignment, and government funding reliance.

Key variables include:

  • patent_id: Patent identifier.

  • num_claims, filing_year, grant year/date: Patent characteristics.

  • team_size: Inventor team size.

  • Technological scope: CPC section, subclass, and group counts.

  • Citations: Backward citations, forward citations (5/7/10 years), originality, generality.

  • Disruption and novelty indicators: cd_5, cd_10, novelty upon granting.

  • Assignee information: IDs, names, type, and counts.

  • Inventor gender composition: Counts of male, female, undefined inventors; women participation indicators.

  • Government reliance: Categorization of patents by reliance on government funding (two-type and three-type).

  • WIPO categories: Sector and field identifiers and titles.

  • Impact metrics: Percentile rankings, top 10% indicators for citation and disruption.

  • Science linkage: Number of cited scientific papers and per-inventor measures.

Data Sources and Construction

The dataset integrates information from multiple sources:

  • PatentsView open data platform: Core source of patent, inventor (including original gender code), assignee, and location data.

  • Merged external datasets:

    1. Funk, R. J., Park, M., & Leahey, E. (2022). Papers and patents are becoming less disruptive over time (1.0). Zenodo. https://doi.org/10.5281/zenodo.7258379

    2. Fleming, L., Green, H., Li, G.-C., Marx, M., & Yao, D. (2019). Replication Data for: Government-funded research increasingly fuels innovation. Harvard Dataverse. https://doi.org/10.7910/DVN/DKESRC

    3. Marx, M., & Fuegi, A. (2020). Reliance on science: Worldwide front-page patent citations to scientific articles. Strategic Management Journal, 41(9), 1572–1594. https://doi.org/10.1002/smj.3145

    4. Marx, M., & Fuegi, A. (2022). Reliance on science by inventors: Hybrid extraction of in-text patent-to-article citations. Journal of Economics & Management Strategy, 31(2), 369–392. https://doi.org/10.1111/jems.12455

Enhancements and derived variables:

  • Final gender code: Created using an LLM-assisted approach, as described in our associated research paper.

  • Patent indicators computed: Originality, generality, combinatorial novelty, etc.

Notes

  • Only utility patents are included; design and plant patents are excluded.

  • Gender inference is probabilistic and based on name-based algorithms plus LLM-assisted refinement. Results should be interpreted with care.

  • Some location and gender data may remain incomplete or missing.

  • Bibliometric indicators follow standard measures in patent analytics literature.

  • This dataset description was created with the assistance with ChatGPT (GPT-5).

Files

01_distinct_inventor_information.csv

Files (1.8 GB)

Name Size Download all
md5:e85d0f3dd2a9980e2a5e9973d4897122
492.2 MB Preview Download
md5:4f0de8d2eaacf1a091f1e2799863de98
47.8 MB Preview Download
md5:35e66c9fcd62e331eae437c86ff170c8
1.2 GB Preview Download

Additional details

Dates

Updated
2025-08-22