Academic

LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Joshua Castillo, Ravi Mukkamala · April 9, 2026 · 1 min read · 52 views

#cs.CL #cs.AI #cs.IR #cs.LG

arXiv:2604.06571v1 Announce Type: new Abstract: Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97\% vs. 93.23\%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.

Executive Summary

The Guardian Parser Pack introduces an AI-driven pipeline for extracting and normalizing missing-person intelligence from diverse, heterogeneous data sources. By integrating multi-engine text extraction, rule-based parsing, schema-first harmonization, and an optional LLM-assisted pathway with validator-guided repair, the system aims to enhance rapid triage and analysis in high-stakes investigations. While the LLM-assisted pathway significantly boosts extraction quality (F1=0.8664) and key-field completeness (96.97%) compared to a deterministic approach, it incurs a substantial runtime cost. The study underscores the potential of probabilistic AI in sensitive contexts when integrated within a robust, auditable, and schema-compliant framework, offering a promising tool for improving investigative efficiency and data utility.

Key Points

▸ The Guardian Parser Pack unifies heterogeneous missing-person data into a schema-compliant representation.
▸ It employs a multi-stage pipeline: text extraction, source identification, schema harmonization, and optional LLM-assisted extraction.
▸ The LLM-assisted pathway significantly outperforms the deterministic pathway in extraction quality (F1: 0.8664 vs. 0.2578) and key-field completeness (96.97% vs. 93.23%).
▸ The deterministic pathway is considerably faster (0.03 s/record vs. 3.95 s/record for LLM).
▸ Schema validation acts as a critical safeguard, ensuring all LLM outputs in the evaluation met initial schema requirements.
▸ The research advocates for controlled, auditable use of probabilistic AI in high-stakes investigative settings.

Merits

Enhanced Data Unification and Normalization

Successfully addresses the critical challenge of integrating disparate data formats (structured forms, posters, web profiles) into a single, usable schema, which is vital for comprehensive analysis in investigations.

Superior Extraction Quality with LLMs

Demonstrates a substantial improvement in the accuracy and completeness of extracted information when leveraging LLMs, directly addressing the limitations of rule-based systems in handling variability and ambiguity.

Robust Schema-First Design

The 'schema-first' approach with built-in validation ensures data integrity and consistency, which is paramount in high-stakes legal and investigative contexts where errors can have severe consequences.

Auditable Pipeline for Probabilistic AI

Proposes a framework that allows for the controlled and auditable use of probabilistic AI, mitigating concerns about 'black box' decision-making by integrating validation and potential repair mechanisms.

Operational Relevance

Directly targets a pressing operational need in missing-person and child-safety investigations, promising to accelerate triage, improve analytical capabilities, and inform search planning.

Demerits

Significant Runtime Disparity

The stark difference in processing speed between the LLM-assisted (3.95 s/record) and deterministic (0.03 s/record) pathways presents a practical bottleneck for real-time or high-volume processing.

Limited Scope of Validation Repair Evaluation

The finding that all LLM outputs passed initial schema validation, while positive, means the effectiveness of the 'validator-guided repair' mechanism itself was not fully demonstrated as a corrective rather than just a safeguard.

Generalizability of Gold-Aligned Dataset

The evaluation on a 'manually aligned subset of 75 cases' for gold-aligned metrics, while rigorous, may not fully capture the vast heterogeneity and complexity of real-world data at scale.

Dependency on LLM Availability and Cost

Reliance on LLMs introduces potential dependencies on external models (e.g., API access, cost implications, data privacy concerns if proprietary LLMs are used for sensitive data).

Potential for Algorithmic Bias

While not explicitly addressed, LLMs can inherit biases from their training data, which could subtly influence extraction or interpretation in ways that might disproportionately affect certain demographics in investigations.

Expert Commentary

This paper presents a compelling argument for the judicious integration of Large Language Models into critical investigative workflows. The 'schema-first, auditable pipeline' is a particularly astute design choice, mitigating many of the inherent risks associated with probabilistic AI in high-stakes environments. While the F1 score improvement is impressive, the significant runtime difference mandates a careful cost-benefit analysis for real-world deployment. Agencies must weigh the gains in data quality against operational throughput requirements. The paper's strength lies in its practical architecture, which acknowledges both the power and the pitfalls of LLMs. Future research should delve deeper into the validator-guided repair mechanism, perhaps by intentionally corrupting data to stress-test its corrective capabilities. Furthermore, addressing the potential for algorithmic bias and the explainability of LLM outputs will be crucial for broader adoption and legal defensibility. Overall, this is a highly valuable contribution to the field, setting a benchmark for responsible AI integration in public safety.

Recommendations

✓ Conduct further research to thoroughly evaluate the 'validator-guided repair' mechanism by introducing controlled errors or incomplete data to assess its corrective efficacy.
✓ Explore strategies for optimizing the runtime performance of the LLM-assisted pathway, potentially through model distillation, fine-tuning smaller models, or parallel processing techniques.
✓ Implement and rigorously test mechanisms for detecting and mitigating algorithmic bias within the LLM-driven extraction process, particularly concerning demographic information.
✓ Develop a comprehensive framework for the explainability of LLM outputs within the pipeline, providing insights into *why* specific extractions or interpretations were made, crucial for legal and ethical oversight.
✓ Engage with law enforcement agencies and legal experts to develop standardized protocols for data provenance, auditing, and human-in-the-loop validation of AI-generated intelligence in investigative contexts.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

AI Commentary