This GenAI Text Summarization project showcases both abstractive and extractive summarization techniques using cutting-edge transformer models. Built with Python and Hugging Face Transformers, it simplifies lengthy documents into concise summaries ideal for news articles, legal texts, academic papers, and content previews. The project integrates BART for natural language generation and MiniLM for semantic sentence extraction using cosine similarity.
Key Insights
Dual summarization approach allows flexibility between abstractive and extractive outputs based on user needs.
BART's generative model produces fluent and human-like summaries, ideal for content creation and compression.
MiniLM-based extractive summarization ensures semantic relevance by selecting top-ranking sentences via cosine similarity.
Supports document-based summarization from PDFs and .txt files, enhancing usability for professionals.
Technical Implementation
Model Architecture:
Implemented abstractive summarization using facebook/bart-large-cnn from Hugging Face Transformers.
Used sentence-transformers/all-MiniLM-L6-v2 for extractive summarization via sentence embeddings and cosine similarity.
Natural Language Processing (NLP):
Used NLTK for text tokenization and preprocessing.
Employed Scikit-learn for calculating cosine similarity to rank sentence relevance.
Interface & Deployment:
Integrated with Streamlit and Flask for interactive summarization through a web interface.
Enabled support for PDF and .txt file uploads to allow document-based summarization.
Video Preview
Key Learnings
Abstractive models like BART can generate natural, human-like summaries by understanding contextual semantics.
Extractive summarization using MiniLM and cosine similarity is effective for shorter texts and preserves factual accuracy.
Transformer-based models require careful handling of input lengths due to token limitations.
Sentence embeddings are crucial for semantic similarity-based extraction and ranking.