Academic

Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

arXiv:2603.03294v1 Announce Type: cross Abstract: Large Language Models show promise for agricultural advisory, yet vanilla models exhibit unsupported recommendations, generic advice lacking specific, actionable detail, and communication styles misaligned with smallholder farmer needs. In high stakes agricultural contexts, where recommendation accuracy has direct consequences for farmer outcomes, these limitations pose challenges for responsible deployment. We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall, while a separate stitching layer transforms retrieved facts into culturally appropriate, safety-aware responses. Our evaluation framework, DG-EVAL, performs atomic fact verification (measuring recall, precision, and contradiction detection) against expert-curated ground truth rather than Wikipedia

arXiv:2603.03294v1 Announce Type: cross Abstract: Large Language Models show promise for agricultural advisory, yet vanilla models exhibit unsupported recommendations, generic advice lacking specific, actionable detail, and communication styles misaligned with smallholder farmer needs. In high stakes agricultural contexts, where recommendation accuracy has direct consequences for farmer outcomes, these limitations pose challenges for responsible deployment. We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall, while a separate stitching layer transforms retrieved facts into culturally appropriate, safety-aware responses. Our evaluation framework, DG-EVAL, performs atomic fact verification (measuring recall, precision, and contradiction detection) against expert-curated ground truth rather than Wikipedia or retrieved documents. Experiments across multiple model configurations on crops and queries from Bihar, India show that fine-tuning on curated data substantially improves fact recall and F1, while maintaining high relevance. Using a fine-tuned smaller model achieves comparable or better factual quality at a fraction of the cost of frontier models. A stitching layer further improves safety subscores while maintaining high conversational quality. We release the farmerchat-prompts library to enable reproducible development of domain-specific agricultural AI.

Executive Summary

This article presents a novel approach to fine-tuning and evaluating conversational AI for agricultural advisory, specifically addressing limitations in Large Language Model (LLM) performance. The proposed hybrid LLM architecture decouples factual retrieval from conversational delivery, leveraging supervised fine-tuning with LoRA on curated data to optimize fact recall. Experimental results demonstrate significant improvements in fact recall, F1, and safety subscores, while maintaining high conversational quality. The release of the farmerchat-prompts library enables reproducible development of domain-specific agricultural AI. This work has significant implications for the responsible deployment of AI in high-stakes agricultural contexts, where recommendation accuracy directly impacts farmer outcomes.

Key Points

  • Decoupling factual retrieval from conversational delivery improves fact recall and F1
  • Supervised fine-tuning with LoRA on curated data optimizes fact recall
  • The farmerchat-prompts library enables reproducible development of domain-specific agricultural AI

Merits

Strength in Methodology

The article presents a well-structured approach to addressing limitations in LLM performance, leveraging a novel hybrid architecture and evaluation framework. The use of expert-curated data and ground truth for evaluation adds credibility to the results.

Improved Factual Quality

The experimental results demonstrate significant improvements in fact recall, F1, and safety subscores, indicating that the proposed approach can improve the factual quality of conversational AI for agricultural advisory.

Demerits

Limited Generalizability

The experimental results are based on a single dataset from Bihar, India, which may limit the generalizability of the findings to other contexts and regions.

Dependence on Curated Data

The proposed approach relies on the availability of curated data, which may not be feasible or practical in all settings, particularly in resource-constrained environments.

Expert Commentary

This article presents a significant contribution to the field of conversational AI, particularly in the context of agricultural advisory. The proposed hybrid architecture and evaluation framework offer a promising approach to addressing the limitations of LLM performance. However, the article's reliance on curated data and limited generalizability of the findings may limit the scalability of the approach. Nevertheless, the release of the farmerchat-prompts library provides a valuable resource for reproducible development of domain-specific agricultural AI. As the field of AI continues to evolve, it is essential to prioritize the development of domain-specific solutions that address the unique needs and challenges of smallholder farmers.

Recommendations

  • Future research should focus on developing approaches that can leverage uncurated data sources, such as online forums and social media, to improve the scalability of the proposed approach.
  • Policymakers should prioritize investments in the development of domain-specific AI solutions for agriculture, particularly in resource-constrained environments.

Sources