Ensemble Retrieval Strategies for an Improved NAICS Search Engine in the U.S. Census Bureau

Milne, Cameron; Lee, Yezzi Angi; Wilson, Taylor; Ferronato, Hector

doi:10.5281/zenodo.13307742

Published August 12, 2024 | Version v1

Publication Open

Ensemble Retrieval Strategies for an Improved NAICS Search Engine in the U.S. Census Bureau

Contributors

Researcher (4):

Large Language Models (LLMs) have achieved significant gains in performance over classical approaches in Information Retrieval (IR) systems. Semantic search, a concept involving a contextual understanding of a query, significantly outperforms its keyword predecessors in recent academic literature. This paper examines five research questions critical to an effective multi-stage retrieval pipeline: [1] how much better can dense embeddings perform over sparse embeddings? [2] what combination of context windows in an ensemble approach can maximize first-candidate generation? [3] among top performers in ensemble approaches, which approach can generate the highest accuracy scores when we increase k? [4] can a fast-reranking algorithm complement first-candidate generation to further boost accuracy? [5] can a cache of seen-before queries help boost performance where the corpus fails? This project designs a novel search engine for the North American Industry Classification System (NAICS) using semantic retrieval models and achieves an accuracy of 86.87% on synthetic data, a 45.1% improvement over a keyword search. The strategies for bootstrapping data, selecting language models, and using open-source technologies within this project can serve as reference material for future LLM applications in the U.S. Census Bureau.

Files

NAICS_JSM_Proceedings_Submission_Aug12.pdf

Files (616.9 kB)

Name	Size	Download all
NAICS_JSM_Proceedings_Submission_Aug12.pdf md5:63df599f489f48898e7e3a42ff9f54a8	616.9 kB	Preview Download

	All versions	This version
Views	291	291
Downloads	223	223
Data volume	165.3 MB	165.3 MB

Ensemble Retrieval Strategies for an Improved NAICS Search Engine in the U.S. Census Bureau

Authors/Creators

Contributors

Researcher (4):

Description

Files

NAICS_JSM_Proceedings_Submission_Aug12.pdf

Files (616.9 kB)