Replication data for: Gender Disparity in U.S. Patenting
Authors/Creators
Description
This dataset provides inventor-, organization-, and patent-level information on U.S. utility patents (1976–2021). It has been curated to enable research on gender disparities in patenting, inventor team composition, organizational characteristics, and innovation outcomes. The dataset is based on disambiguated inventor, assignee, and patent information, enriched with bibliometric, geographic, and citation indicators.
The dataset consists of three CSV files:
1. 01_distinct_inventor_information.csv
Unit of analysis: Unique inventors.
Description: Contains demographic, geographic, and innovation-related characteristics for distinct inventors in the dataset.
Key variables include:
-
inventor_id: Unique inventor identifier.
-
inventor_first_name, inventor_last_name: Disambiguated inventor names.
-
patent_count: Number of patents linked to the inventor.
-
gender_code, gender_evaluation_method: Assigned gender and method of inference.
-
first_filing_year, last_filing_year: Patent activity period.
-
first_author_patent_count: Number of patents with inventor listed first.
-
Technological scope: Counts of CPC subclasses and subgroups per inventor.
-
Backward and forward citations: Sums and means across patents.
-
Bibliometric indicators: Originality, generality, combinatorial novelty (cd_5, cd_2017y).
-
Science linkage: Number of cited scientific papers.
-
Geographic information: City, state, country, county, latitude/longitude, and FIPS codes.
2. 02_distinct_organizational_assignee_information.csv
Unit of analysis: Unique organizational assignees.
Description: Summarizes the characteristics of distinct organizational assignees, including patenting activity, gender composition of inventor teams, and bibliometric indicators.
Key variables include:
-
assignee_id, disambig_assignee_organization: Unique ID and disambiguated organization name.
-
patent_count: Number of assigned patents.
-
assignee_type, assignee_type_name, assignee_type_name_adj: Organization type (e.g., firm, university, government).
-
first_filing_year, last_filing_year: Patent activity period.
-
Inventor gender composition: Male, female, undefined counts; all-male, all-female, and gender-collaboration team measures.
-
Technological scope: Mean counts of CPC section, subclass, and group.
-
Citation measures: Backward and forward citations, scientific publication citations.
-
Bibliometric indicators: Originality, generality, combinatorial novelty.
-
Gender ratios: Fraction of patents with women inventors, team gender ratios.
3. 03_utility_patent_information.csv
Unit of analysis: Individual utility patents.
Description: Provides patent-level information, including bibliometric measures, team composition, organizational assignment, and government funding reliance.
Key variables include:
-
patent_id: Patent identifier.
-
num_claims, filing_year, grant year/date: Patent characteristics.
-
team_size: Inventor team size.
-
Technological scope: CPC section, subclass, and group counts.
-
Citations: Backward citations, forward citations (5/7/10 years), originality, generality.
-
Disruption and novelty indicators: cd_5, cd_10, novelty upon granting.
-
Assignee information: IDs, names, type, and counts.
-
Inventor gender composition: Counts of male, female, undefined inventors; women participation indicators.
-
Government reliance: Categorization of patents by reliance on government funding (two-type and three-type).
-
WIPO categories: Sector and field identifiers and titles.
-
Impact metrics: Percentile rankings, top 10% indicators for citation and disruption.
-
Science linkage: Number of cited scientific papers and per-inventor measures.
Data Sources and Construction
The dataset integrates information from multiple sources:
-
PatentsView open data platform: Core source of patent, inventor (including original gender code), assignee, and location data.
-
Merged external datasets:
-
Funk, R. J., Park, M., & Leahey, E. (2022). Papers and patents are becoming less disruptive over time (1.0). Zenodo. https://doi.org/10.5281/zenodo.7258379
-
Fleming, L., Green, H., Li, G.-C., Marx, M., & Yao, D. (2019). Replication Data for: Government-funded research increasingly fuels innovation. Harvard Dataverse. https://doi.org/10.7910/DVN/DKESRC
-
Marx, M., & Fuegi, A. (2020). Reliance on science: Worldwide front-page patent citations to scientific articles. Strategic Management Journal, 41(9), 1572–1594. https://doi.org/10.1002/smj.3145
-
Marx, M., & Fuegi, A. (2022). Reliance on science by inventors: Hybrid extraction of in-text patent-to-article citations. Journal of Economics & Management Strategy, 31(2), 369–392. https://doi.org/10.1111/jems.12455
-
Enhancements and derived variables:
-
Final gender code: Created using an LLM-assisted approach, as described in our associated research paper.
-
Patent indicators computed: Originality, generality, combinatorial novelty, etc.
Notes
-
Only utility patents are included; design and plant patents are excluded.
-
Gender inference is probabilistic and based on name-based algorithms plus LLM-assisted refinement. Results should be interpreted with care.
-
Some location and gender data may remain incomplete or missing.
-
Bibliometric indicators follow standard measures in patent analytics literature.
- This dataset description was created with the assistance with ChatGPT (GPT-5).
Files
01_distinct_inventor_information.csv
Additional details
Dates
- Updated
-
2025-08-22