A T5 Model finetuned on lecture dataset + CNN/Daily Mail.
Students often struggle with processing large volumes of content throughout the course of their studies. This in turn will lead to a decrease in student productivity. Utilizing an effective summarization tool, students can better prepare for exams, improve their own summarization skills via learning by comparison, and avoid wasting time on low-quality content by skimming the summary. Nub 1.0 is a text summarizer that leverages deep learning to specialize in educational content. When evaluated on the CNN/Daily Mail dataset, It shows superior performance compared to similar tools.
You'll need Python 3.7+, and pip. Simply clone the repo, and install the requirements:
pip install -r requirements.txtThen start the web app from the command line:
streamlit run app.pyThis will print the local URL or open the web app in a broswer.
Install Transformers:
pip install transformersLoad from huggingface model hub:
from transformers import pipeline, T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained('soroush/t5-finetuned-lesson-summarizer')
tokenizer = T5Tokenizer.from_pretrained('soroush/t5-finetuned-lesson-summarizer')
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)
Now enter your text and pass it to the pipeline:
input_text = '''Up until this point, we've focused on learners that provide forecast price changes. We then buy or sell the stocks with the most significant predicted price change. This approach ignores some important issues, such as the certainty of the price change. It also doesn't help us know when to exit the position either. In this lesson, we'll look at reinforcement learning. Reinforcement learners create policies that provide specific direction on which action to take. It's important to point out that when we say reinforcement learning, we're really describing a problem, not a solution. In the same way that linear regression is one solution to the supervised regression problem, there are many algorithms that solve the RL problem. Because I started out as a roboticist, I'm going to first explain this in terms of a problem for a robot. '''
output = summarizer(input_text)
output_text = output[0]['summary_text']
print(output_text)The summarizer might take a long time with longer documents. Dividing the document into a few chunks and running them separately helps with the running time.
When run_qa.py you may need to decrease the max_length=4096 to match the hardware capacity.
If installing the pytorch version is too heavy for the hardware in use, try downgrading to
a cpu-only version pip install requirements_cpu.txt
To process the lectures from ML4T:
python process_srt_files.py \
--output_dir './data/processed_lessons/' \
--lessons_dir 'data/raw_lessons'
Each lesson is a collection of videos in a directory that lives in raw_lessons.
This script will take video transcripts in .srt format and putputs a single-line text document per lesson.
To run the Question Answering module to autogenerate unsupervised summaries for finetuning, run:
python run_qa.py \
--output_dir './data/qa_generated_summaries/'
--lessons_dir './data/processed_lessons'
--questions_dir './data/lesson_questions'
lesson_questions: directory containing question files (names match the lesson files)processed_lessons: the output ofprocess_srt_files.py.qa_generated_summaries: the directory containing the QA generated summariesraw_lessons: each directory inside here has one or more.srtfiles
To fine-tune the model on a downstream task follow steps in nub-training-evaluation/fine-tuning T-5 on CNN+daily mail + ML4T.ipynb.
Step by step process documented here: nub-training-evaluation/Run Evaluations on Fine-tuned T-5 Summarizer.ipynb
Comparative Analyses documented in nub-training-evaluation/Results Analysis and Comparison.ipynb
- Analysis result can be found in the corresponding csv files under
/nub-training-evaluation/result
@inproceedings{...,
year={2020}
}- readme guide from https://github.com/huggingface/model_card/edit/master/README.md
Tags
License
Datasets
Metrics