To develop a robust PDF Knowledge Extraction System using RAG and CAG models with dual embedding pipelines and ChromaDB, enabling efficient, intelligent querying and rapid response generation for diverse documents.
With the digital transformation of information that happened so quickly, there has been an accumulation of PDF documents through which knowledge flows. In this project, a very sturdy PDF Knowledge Extraction System was presented integrating the RAG and CAG models for intelligent and scalable document querying. It permits users to upload PDF files and then these files will be automatically parsed and segmented into chunks of content. Considered two parallel embedding pipelines: one uses Google Gemini 1.5 Flash API to generate high-quality embeddings for RAG model and the other uses HuggingFace models to cache in CAG framework.
Embeddings from both pipelines are kept in two different vector stores using
ChromaDB, which guarantees rapid retrieval and response generation. When a
query goes in, the system looks into the cache to see if any results are there.
In case of a cache hit, an appropriate answer is returned immediately with just
some milliseconds of latency. A cache miss presents where the query will be
processed via RAG as well as be cached for later requests. This hybridization
is mainly for optimized performance since RAG contributes its content awareness
while CAG supports excellent efficiency, suitable for knowledge-agent type
applications in a wide array of domains.
NOTE: Without the concern of our team, please don't submit to the college. This Abstract varies based on student requirements.
