Published August 12, 2024 | Version v1
Publication Open

Ensemble Retrieval Strategies for an Improved NAICS Search Engine in the U.S. Census Bureau

Description

Large Language Models (LLMs) have achieved significant gains in performance over classical approaches in Information Retrieval (IR) systems. Semantic search, a concept involving a contextual understanding of a query, significantly outperforms its keyword predecessors in recent academic literature. This paper examines five research questions critical to an effective multi-stage retrieval pipeline: [1] how much better can dense embeddings perform over sparse embeddings? [2] what combination of context windows in an ensemble approach can maximize first-candidate generation? [3] among top performers in ensemble approaches, which approach can generate the highest accuracy scores when we increase k? [4] can a fast-reranking algorithm complement first-candidate generation to further boost accuracy? [5] can a cache of seen-before queries help boost performance where the corpus fails? This project designs a novel search engine for the North American Industry Classification System (NAICS) using semantic retrieval models and achieves an accuracy of 86.87% on synthetic data, a 45.1% improvement over a keyword search. The strategies for bootstrapping data, selecting language models, and using open-source technologies within this project can serve as reference material for future LLM applications in the U.S. Census Bureau.

Files

NAICS_JSM_Proceedings_Submission_Aug12.pdf

Files (616.9 kB)

Name Size Download all
md5:63df599f489f48898e7e3a42ff9f54a8
616.9 kB Preview Download