Retrieval Augmented Generation – Part 1

Charles F. Vardeman II

Center for Research Computing, University of Notre Dame

2024-01-19

Hypothesis: Retrieval Augmented Generation Requires Curation

Knowledge Engineering Using Large Language Models

Allen, Bradley P, Lise Stork, and Paul Groth. 2023. “Knowledge Engineering Using Large Language Models.” arXiv.Org. October 1, 2023. https://arxiv.org/abs/2310.00637

Prompt Engineering as Knowledge Engineering

Allen, Bradley P, Lise Stork, and Paul Groth. 2023. “Knowledge Engineering Using Large Language Models.” arXiv.Org. October 1, 2023. https://arxiv.org/abs/2310.00637

Knowledge Engineering Practice

Allen, Bradley P, Lise Stork, and Paul Groth. 2023. “Knowledge Engineering Using Large Language Models.” arXiv.Org. October 1, 2023. https://arxiv.org/abs/2310.00637

Trusted AI, LLMs and KE

Allen, Bradley P, Lise Stork, and Paul Groth. 2023. “Knowledge Engineering Using Large Language Models.” arXiv.Org. October 1, 2023. https://arxiv.org/abs/2310.00637

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, et al. 2024. “Retrieval-Augmented Generation for Large Language Models: A Survey.” arXiv. https://doi.org/10.48550/arXiv.2312.10997.

Sparse and Dense Representations

In a Post-Moore's Law world, how do data science and data engineering need to change? This talk presents design patterns for idiomatic programming in Python so that hardware can optimize machine learning workflows. We'll look at ways of handling data that are either "sparse" or "dense" depending on the stage of ML workflow – plus, how to leverage profiling tools in Python to understand how to take advantage of hardware. We'll also consider four key abstractions which are outside of most programming languages, but vital in data science work.

Paco Nathan, 2021, “Thinking Sparse and Dense”

Retrieval Augmented Generation – The Idea

Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, et al. 2024. “Retrieval-Augmented Generation for Large Language Models: A Survey.” arXiv. https://doi.org/10.48550/arXiv.2312.10997.

Naive RAG

  • Indexing

    • Data Indexing: Cleaning and extracting data from PDF, HTML, Word, Markdown, Images

    • Chunking: Dividing text into smaller chunks for LLM limited context window

    • Embedding and Creating Index: Encoding text/images into vectors through a language model

  • Retrieve: Given a user input, retrieve relevant information

  • Generation: The user query to the LLM and related documents from retrieval are combined into a new prompt. The LLM generates a response based on this new context window.

Naive RAG Architecture

Langchain Q&A with RAG

Some text

DoD must accelerate its progress towards becoming a data-centric organization. DoD has lacked the enterprise data management to ensure that trusted, critical data is widely available to or accessible by mission commanders, warfighters, decision-makers, and mission partners in a real- time, useable, secure, and linked manner. This limits data-driven decisions and insights, which hinders the execution of swift and appropriate action.

Additionally, DoD software and hardware systems must be designed, procured, tested, upgraded, operated, and sustained with data interoperability as a key requirement. All too often these gaps are bridged with unnecessary human-machine interfaces that introduce complexity, delay, and increased risk of error. This constrains the Department’s ability to operate against threats at
machine speed across all domains.

DoD also must improve skills in data fields necessary for effective data management. The Department must broaden efforts to assess our current talent, recruit new data experts, and retain our developing force while establishing policies to ensure that data talent is cultivated. We must also spend the time to increase the data acumen resident across the workforce and find optimal ways to promote a culture of data awareness.

“Chunking”

“Chunkviz”

“Chunking” with Overlap

“Chunkviz”

Smarter “Chunking”

LangChain - Recursively Split by Character

“Chunking” recursive character splitter

“Chunkviz”

“Chunking” with larger segment size

“Chunkviz”

Vector Indexing of the “Chunks”

from langchain_community.embeddings import FakeEmbeddings
embeddings = FakeEmbeddings(size=1352)
query_result = embeddings.embed_query(dod_text)
print(dod_text[:5])
query_result[:5]
DoD m
[0.28925496400357076,
 0.42954295410387294,
 -0.75042013219397,
 -0.21105104953004536,
 -0.655199848252018]
Figure 1: Vector representation of the text

Trusted AI Point of View…

Failure points for RAG

Barnett, Scott, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, and Mohamed Abdelrazek. 2024. “Seven Failure Points When Engineering a Retrieval Augmented Generation System.”

Problem: A global constant for Chunk Size doesn’t take into account the semantic structure of a document.

“Agentic” Chunking

LangChain on X: Proposition-Based Retrieval

Agentic Example: Proposition Based Dense Retrieval

Chen, Tong, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. 2023. “Dense X Retrieval: What Retrieval Granularity Should We Use?” arXiv.Org. December 11, 2023. https://arxiv.org/abs/2312.06648v2.

RAG Complexity Overview

Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, et al. 2024. “Retrieval-Augmented Generation for Large Language Models: A Survey.” arXiv. https://doi.org/10.48550/arXiv.2312.10997.

Comparison with other optimization methods

Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, et al. 2024. “Retrieval-Augmented Generation for Large Language Models: A Survey.” arXiv. https://doi.org/10.48550/arXiv.2312.10997.

LLMs and Trusted AI

Sun, Lichao, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, et al. 2024. “TrustLLM: Trustworthiness in Large Language Models.” arXiv. http://arxiv.org/abs/2401.05561.

Graph Based Vector Retrieval