PDF Knowledge Extraction System with RAG and CAG Models

Project Code :TCPGPY2090

Objective

Objective of Project: Primary Goals • Build a scalable system for extracting deep knowledge from large PDF sets. • Use RAG for dynamic retrieval and CAG for caching to improve speed, accuracy, and efficiency. • Overcome current limitations like slow response, poor context handling, and high computation costs. Specific Aims • PDF Parsing & Chunking • Develop methods to extract and structure PDF content for embedding and semantic search. RAG + CAG Integration • Combine retrieval and caching to boost query speed and reduce resource usage. • Expected Outcome • A high-performance, low-overhead framework for knowledge extraction from unstructured PDFs.

Abstract

With the digital transformation of information that happened so quickly, there has been an accumulation of PDF documents through which knowledge flows. In this project, a very sturdy PDF Knowledge Extraction System was presented integrating the RAG and CAG models for intelligent and scalable document querying. It permits users to upload PDF files and then these files will be automatically parsed and segmented into chunks of content. Considered two parallel embedding pipelines: one uses Google Gemini 1.5 Flash API to generate high-quality embeddings for RAG model and the other uses HuggingFace models to cache in CAG framework.


Embeddings from both pipelines are kept in two different vector stores using ChromaDB, which guarantees rapid retrieval and response generation. When a query goes in, the system looks into the cache to see if any results are there. In case of a cache hit, an appropriate answer is returned immediately with just some milliseconds of latency. A cache miss presents where the query will be processed via RAG as well as be cached for later requests. This hybridization is mainly for optimized performance since RAG contributes its content awareness while CAG supports excellent efficiency, suitable for knowledge-agent type applications in a wide array of domains.

NOTE: Without the concern of our team, please don't submit to the college. This Abstract varies based on student requirements.

Block Diagram

Specifications

  • Hardware:
    • CPU: Intel i7, 16 GB RAM
    • GPU: Nvidia GTX 1660 (for embedding generation and model inference)
    • Disk Storage: 1TB SSD for storing PDFs and generated embeddings
    • Network: 1Gbps internet connection (for API calls to Google’s Gemini 1.5 Flash API and HuggingFace models)
  • Software:
    • Operating System: Ubuntu 20.04
    • Framework: Flask (Python)
    • Embedding Generation: Google Gemini 1.5 Flash API, HuggingFace models (for contextual embeddings)
    • Database: ChromaDB for storing embeddings
    • Libraries: PyMuPDF for PDF text extraction, TensorFlow and PyTorch for model inference
    • Testing Tools: Postman for API testing, Jupyter notebooks for result analysis

Demo Video