Show simple item record

dc.contributor.authorMorozovskii, Danila
dc.date.accessioned2023-01-30T22:08:38Z
dc.date.available2023-01-30T22:08:38Z
dc.date.issued2023-01-16
dc.identifier.citationMorozovskii, Danila. Rare words in text summarization; A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in the Department of Applied Computer Science. Winnipeg, Manitoba, Canada: University of Winnipeg, 2022. DOI: 10.36939/ir.202201301602.en_US
dc.identifier.urihttps://hdl.handle.net/10680/2030
dc.description.abstractAutomatic text summarization is a difficult task, which involves a good understanding of an input text to produce fluent, brief and vast summary. The usage of text summarization models can vary from legal document summarization to news summarization. The model should be able to understand where important information is located to produce a good summary. However, infrequently used or rare words might limit model’s understanding of an input text, as the model might ignore such words or put less attention on them. Another issue is that the model accepts only a limited amount of tokens (words) of an input text, which might contain redundant information or not including important information as it is located further in the text. To address the problem of rare words, we have proposed a modification to the attention mechanism of the transformer model with pointer-generator layer, where attention mechanism receives frequency information for each word, which helps to boost rare words. Additionally, our proposed supervised learning model uses the hybrid approach incorporating both extractive and abstractive elements, to include more important information for the abstractive model in a news summarization task. We have designed experiments involving a combination of six different hybrid models with varying input text sizes (measured as tokens) to test our proposed model. Four wellknown datasets specific to news articles were used in this work: CNN/DM, XSum, Gigaword and DUC 2004 Task 1. Our results were compared using the well-known ROUGE metric. Our best model achieved R-1 score of 38.22, R-2 score of 15.07 and R-L score of 35.79, outperforming three existing models by several ROUGE points.en_US
dc.language.isoenen_US
dc.publisherUniversity of Winnipegen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectAbstractive Summarizationen_US
dc.subjectExtractive Summarizationen_US
dc.subjectNatural Language Processingen_US
dc.subjectPointer-generatoren_US
dc.subjectTransformeren_US
dc.subjectRare Wordsen_US
dc.subjectText Summarizationen_US
dc.titleRare words in text summarizationen_US
dc.typeThesisen_US
dc.description.degreeMaster of Science in Applied Computer Scienceen_US
dc.publisher.grantorUniversity of Winnipegen_US
dc.identifier.doi10.36939/ir.202201301602en_US
thesis.degree.disciplineApplied Computer Science
thesis.degree.levelmasters
thesis.degree.nameMaster of Science in Applied Computer Science
thesis.degree.grantorUniversity of Winnipeg


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record