Leveraging Retrieval Augmented Generation (RAG) to Analyze Crash Reports Narratives
Project Description
Crash reports serve as a vital source of information for understanding road crashes, devising strategies for prevention, and informing policies. However, the coding on these reports often lacks detailed characteristics crucial for comprehensive analysis of pedestrian and bicyclist crashes. Crash reports typically contain structured data, which may lack the nuanced details often found in the narrative section regarding the circumstances surrounding a crash. Information such as unhoused status of a pedestrian, detailed explanation of the vehicle movement before hitting a pedestrian, witness description of a speeding vehicle’s behavior pre-crash, and description of a hit-and-run crash conditions may be embedded within the narrative descriptions but remain unrecorded in the structured fields of the report form. Extracting this implicit data poses a significant challenge for traditional analysis methods. Retrieval Augmented Generation (RAG), employs an embedding model to scan extensive text, seeking similarities between the query—here, the presence of a vulnerability factor or demographic context—and segments of the text. Once relevant portions are pinpointed, both the query and context undergo analysis by a Large Language Model (LLM). In this instance, the LLM validates the presence of and extracts pertinent information. This study will explore the ability of RAG to identify crash characteristics found only in the crash report narratives using crash reports from California.
Outputs
The project will produce a policy brief narrative for distribution by CPBS and a general public outreach article to be distributed by SafeTREC. The project will also produce a final report and a public github repository with the code developed for the proof of concept software developed in Python to analyze crash report narratives. The team will also produce an academic research paper that will be submitted for presentation or publication at transportation journals like Transportation Research Records.
Outcomes / Impacts
This project will provide a proof of concept for a method that could greatly improve the ability of transportation safety engineers, planners, and researchers to efficiently review crash narratives and glean additional information that is not in the coding. As our transportation system changes faster than the crash report forms can keep up, this method will allow those working on addressing the most pressing safety issues to make informed decisions and respond quickly to new safety challenges.
Dates
06/01/2024 to 05/31/2025
Universities
University of California at Berkeley
Principal Investigator
Julia Griswold
University of California at Berkeley
ORCID: 0000-0002-1125-3316
Research Project Funding
Federal: $115,020
Contract Number
69A3552348336
Project Number
24UCB03
Research Priority
Promoting Safety