Rare words in text summarization

Morozovskii, Danila

dc.contributor.author	Morozovskii, Danila
dc.date.accessioned	2023-01-30T22:08:38Z
dc.date.available	2023-01-30T22:08:38Z
dc.date.issued	2023-01-16
dc.identifier.citation	Morozovskii, Danila. Rare words in text summarization; A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in the Department of Applied Computer Science. Winnipeg, Manitoba, Canada: University of Winnipeg, 2022. DOI: 10.36939/ir.202201301602.	en_US
dc.identifier.uri	https://hdl.handle.net/10680/2030
dc.description.abstract	Automatic text summarization is a difficult task, which involves a good understanding of an input text to produce fluent, brief and vast summary. The usage of text summarization models can vary from legal document summarization to news summarization. The model should be able to understand where important information is located to produce a good summary. However, infrequently used or rare words might limit model’s understanding of an input text, as the model might ignore such words or put less attention on them. Another issue is that the model accepts only a limited amount of tokens (words) of an input text, which might contain redundant information or not including important information as it is located further in the text. To address the problem of rare words, we have proposed a modification to the attention mechanism of the transformer model with pointer-generator layer, where attention mechanism receives frequency information for each word, which helps to boost rare words. Additionally, our proposed supervised learning model uses the hybrid approach incorporating both extractive and abstractive elements, to include more important information for the abstractive model in a news summarization task. We have designed experiments involving a combination of six different hybrid models with varying input text sizes (measured as tokens) to test our proposed model. Four wellknown datasets specific to news articles were used in this work: CNN/DM, XSum, Gigaword and DUC 2004 Task 1. Our results were compared using the well-known ROUGE metric. Our best model achieved R-1 score of 38.22, R-2 score of 15.07 and R-L score of 35.79, outperforming three existing models by several ROUGE points.	en_US
dc.language.iso	en	en_US
dc.publisher	University of Winnipeg	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Abstractive Summarization	en_US
dc.subject	Extractive Summarization	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Pointer-generator	en_US
dc.subject	Transformer	en_US
dc.subject	Rare Words	en_US
dc.subject	Text Summarization	en_US
dc.title	Rare words in text summarization	en_US
dc.type	Thesis	en_US
dc.description.degree	Master of Science in Applied Computer Science	en_US
dc.publisher.grantor	University of Winnipeg	en_US
dc.identifier.doi	10.36939/ir.202201301602	en_US
thesis.degree.discipline	Applied Computer Science
thesis.degree.level	masters
thesis.degree.name	Master of Science in Applied Computer Science
thesis.degree.grantor	University of Winnipeg

Files in this item

Name:: Morozovskii_Danila_Thesis_Fina ...
Size:: 2.251Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses and Dissertations

Show simple item record