In the rapidly evolving landscape of Generative AI, Large Language Models (LLMs) like GPT-4 are powerful, but they have a significant limitation: their knowledge is frozen in time. This is where Retrieval-Augmented Generation (RAG) comes in.
In this guide, we will build a production-ready RAG application from scratch using Python, OpenAI's embedding models, and Pinecone as our vector database. We will cover everything from data ingestion to the final generation loop.
What is RAG and Why Does It Matter?
RAG is an architectural pattern that allows an LLM to access real-time, external data without the need for expensive fine-tuning. By retrieving relevant document snippets and "stuffing" them into the prompt, we can ensure the AI provides accurate, up-to-date, and context-aware answers.
This approach is particularly useful for building institutional knowledge bases, automated customer support, and research assistants that require access to private or very recent information.
High-Level Workflow:
- Ingest: Convert text documents into numerical vectors (embeddings).
- Store: Save ini vectors in a specialized Vector Database (Pinecone).
- Retrieve: When a user asks a question, find the most similar vectors in the database.
- Generate: Send the retrieved context + the user query to the LLM to get an answer.
Prerequisites
Before we dive into the code, ensure you have the following setup ready on your local machine:
- Python 3.9+ installed.
- An OpenAI API Key (with available credits).
- A Pinecone API Key (The free 'Starter' plan works perfectly).
- A code editor like VS Code or PyCharm.
Install Required Libraries
Open your terminal and run the following command to install the necessary packages:
pip install openai pinecone-client python-dotenv langchain
Step 1: Initializing the Environment
First, we need to set up our API keys securely. Create a .env file in your project directory to store your sensitive credentials:
OPENAI_API_KEY=your_openai_key_here
PINECONE_API_KEY=your_pinecone_key_here
PINECONE_ENVIRONMENT=us-east-1-aws
Step 2: Preparing and Embedding the Data
We will convert a simple text document into embeddings. An embedding is a vector representation of text that captures its semantic meaning, allowing the computer to "understand" the relationship between words.
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def get_embedding(text, model="text-embedding-3-small"):
# Replace newlines with spaces for better embedding quality
text = text.replace("\n", " ")
return client.embeddings.create(input=[text], model=model).data[0].embedding
# Example usage
text_data = "LabsGenAI specializes in Generative AI development and LLM orchestration."
vector = get_embedding(text_data)
print(f"Vector Length: {len(vector)}")
Step 3: Setting Up Pinecone Vector Database
Now, we need to initialize Pinecone and create an index. This index will act as the "brain" where our vectorized documents are stored and searched.
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index_name = "labsgenai-rag-index"
# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536, # Dimension for text-embedding-3-small
metric='cosine',
spec=ServerlessSpec(cloud='aws', region='us-east-1')
)
index = pc.Index(index_name)
Step 4: The Retrieval and Generation Loop
This is the core of the RAG application. We query the index to find the most relevant information and use it to augment our prompt to the LLM.
def generate_answer(query):
# 1. Embed the user query
query_vector = get_embedding(query)
# 2. Retrieve top 2 matching chunks from Pinecone
results = index.query(vector=query_vector, top_k=2, include_metadata=True)
context = ""
for match in results['matches']:
context += match['metadata']['text'] + "\n"
# 3. Augment the prompt and Generate
prompt = f"""
You are a technical assistant. Answer the question based ONLY on the context provided below.
Context:
{context}
Question: {query}
Answer:"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Test the RAG flow
# print(generate_answer("What does LabsGenAI specialize in?"))
Conclusion
Building a RAG pipeline is the first step toward creating truly useful AI agents. By combining the reasoning power of OpenAI's LLMs with the scalable storage of Pinecone, you can build applications that "know" your private data.
In our next article on LabsGenAI.net, we will explore how to scale this architecture using LangChain for more complex document parsing and handling larger datasets.
Stay tuned for more deep dives into the world of Generative AI at LabsGenAI.net.