Summarize text document using transformers and BERT

Summarize text document using transformers and BERT.

What is transformers?

Transformers provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.

For this tutorial I am using bert-extractive-summarizer python package.

It warps around transformer package by Huggingface. It can use any huggingface transformer models to extract summaries out of text.

Lets install bert-extractive-summarizer in google colab.

!pip install git+https://github.com/dmmiller612/bert-extractive-summarizer.git@small-updates

If you want to install in your system then,

pip install git+https://github.com/dmmiller612/bert-extractive-summarizer.git@small-updates

1. The document I want to summarize:

Here is a wikipedia article about Johnny Depp. The document stored in text variable.

text = '''
John Christopher Depp II (born June 9, 1963) is an American actor, producer, and musician. He has been nominated for ten Golden Globe Awards, winning one for Best Actor for his performance of the title role in Sweeney Todd: The Demon Barber of Fleet Street (2007), and has been nominated for three Academy Awards for Best Actor, among other accolades. He is regarded as one of the world's biggest film stars.[1][2] Depp made his film debut in the 1984 film A Nightmare on Elm Street, before rising to prominence as a teen idol on the television series 21 Jump Street (1987–1990). He had a supporting role in Oliver Stone's 1986 war film Platoon and played the title character in the 1990 romantic fantasy Edward Scissorhands.

Depp has gained critical praise for his portrayals of inept screenwriter-director Ed Wood in the film of the same name (1994), undercover FBI agent Joseph D. Pistone in Donnie Brasco (1997), author J. M. Barrie in Finding Neverland (2004) and Boston gangster Whitey Bulger in Black Mass (2015). He has starred in a number of successful films, including Cry-Baby (1990), Dead Man (1995), Sleepy Hollow (1999), Charlie and the Chocolate Factory (2005), Corpse Bride (2005), Public Enemies (2009), Alice in Wonderland (2010) and its 2016 sequel, The Tourist (2010), Rango (2011), Dark Shadows (2012), Into the Woods (2014), and Fantastic Beasts: The Crimes of Grindelwald (2018). Depp also plays Jack Sparrow in the swashbuckler film series Pirates of the Caribbean (2003–present).

Depp is the tenth highest-grossing actor worldwide, as films featuring Depp have grossed over US$3.7 billion at the United States box office and over US$10 billion worldwide.[3] He has been listed in the 2012 Guinness World Records as the world's highest-paid actor, with earnings of US$75 million.[4][5] Depp has collaborated on eight films with director, producer, and friend Tim Burton. He was inducted as a Disney Legend in 2015.[6] In addition to acting, Depp has also worked as a musician. He has performed in numerous musical groups, including forming the rock supergroup Hollywood Vampires along with Alice Cooper and Joe Perry.
'''

2. Use the default model to summarize

By default bert-extractive-summarizer uses the ‘bert-large-uncased‘ pretrained model.

Now lets see the code to get summary,

from summarizer import Summarizer

#Create default summarizer model
model = Summarizer()

# Extract summary out of ''text"
# min_length = Minimum number of words.
# ratio = 1% of total sentences will be in summary.
model(text, min_length=60, ratio=0.01)

The above model gives below summary,

John Christopher Depp II (born June 9, 1963) is an American actor, producer, and musician. He is regarded as one of the world's biggest film stars.[1][2] Depp made his film debut in the 1984 film A Nightmare on Elm Street, before rising to prominence as a teen idol on the television series 21 Jump Street (1987–1990).

3. Use custom model – Albert

Lets see how the albert-base-v2 model summary look like..

from transformers import AlbertTokenizer, AlbertModel

albert_model = AlbertModel.from_pretrained('albert-base-v2', output_hidden_states=True)
albert_tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

albert_model = Summarizer(custom_model=albert_model, custom_tokenizer=albert_tokenizer, random_state = 7)

albert_model(text, min_length=60, ratio=0.01)

This produces summary as below,

John Christopher Depp II (born June 9, 1963) is an American actor, producer, and musician. He is regarded as one of the world's biggest film stars.[1][2] Depp made his film debut in the 1984 film A Nightmare on Elm Street, before rising to prominence as a teen idol on the television series 21 Jump Street (1987–1990).

4. Use Distilbert model

Lets see how distilbert model sumamry looks like.

from transformers import DistilBertModel, DistilBertTokenizer

distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased', output_hidden_states=True)
distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

distilbert_model = Summarizer(custom_model=distilbert_model, custom_tokenizer=distilbert_tokenizer, random_state = 7)

distilbert_model(text)

This produces below summary,

John Christopher Depp II (born June 9, 1963) is an American actor, producer, and musician. Depp has gained critical praise for his portrayals of inept screenwriter-director Ed Wood in the film of the same name (1994), undercover FBI agent Joseph D. Pistone in Donnie Brasco (1997), author J. M. Barrie in Finding Neverland (2004) and Boston gangster Whitey Bulger in Black Mass (2015). He was inducted as a Disney Legend in 2015.[6] In addition to acting, Depp has also worked as a musician.

5. Use any custom huggingface model

Lets use a tiny transformer model called bert-tiny-finetuned-squadv2

from transformers import *

# Load model, model config and tokenizer via Transformers
custom_config = AutoConfig.from_pretrained('mrm8488/bert-tiny-finetuned-squadv2')
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained('mrm8488/bert-tiny-finetuned-squadv2')
custom_model = AutoModel.from_pretrained('mrm8488/bert-tiny-finetuned-squadv2', config=custom_config)

from summarizer import Summarizer

bert_tiny_model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)

bert_tiny_model(text)

Which gives below sumamry,

John Christopher Depp II (born June 9, 1963) is an American actor, producer, and musician. Depp has gained critical praise for his portrayals of inept screenwriter-director Ed Wood in the film of the same name (1994), undercover FBI agent Joseph D. Pistone in Donnie Brasco (1997), author J. M. Barrie in Finding Neverland (2004) and Boston gangster Whitey Bulger in Black Mass (2015). Depp also plays Jack Sparrow in the swashbuckler film series Pirates of the Caribbean (2003–present).

6. Lets see the performance of all the models

You can see all the models giving pretty decent summaries.

Have you noticed anything different?

The inference time is different here for different models. Lets see using %timeit

# default model / bert-large-uncased
%timeit model(text, min_length=60, ratio=0.01)
# 1 loop, best of 3: 8.21 s per loop

%timeit albert_model(text, min_length=60, ratio=0.01)
# 1 loop, best of 3: 2.57 s per loop

%timeit distilbert_model(text, min_length=60, ratio=0.01)
# 1 loop, best of 3: 1.16 s per loop

%timeit bert_tiny_model(text)
# 10 loops, best of 3: 56.7 ms per loop

As you can see the bert_tiny_model is fastest among all the models and gives pretty good summary also.

The timeit data shown here is for the text I choosen for example here.

It can vary for different length of text.

Bert_tiny gave good results with fastest inference time. Use diffent models and analyze the summary results.

Here is the colab notebook for this blog post.

Let me know in comments section, if you are facing any issues.

You can follow me on Twitter and Instagram to get notified whenever I post new content.

Happy learning 🙂

Some of my other blogs,

https://theaidigest.in/extract-data-from-elasticsearch-using-python/

https://theaidigest.in/what-is-elasticsearch/

https://theaidigest.in/how-does-elasticsearch-scoring-work/

https://theaidigest.in/load-csv-into-elasticsearch-using-python/

https://theaidigest.in/extract-data-from-elasticsearch-using-kibana-dev-tools/

4 replies on “Summarize text document using transformers and BERT”

[…] https://theaidigest.in/summarize-text-document-using-transformers-and-bert/ […]

[…] *AI Text Summarizers: A Look Under-the-Hood: Satyanarayan Bhanja, a machine learning engineer, offers a step-by-step look at how AI tools are used to summarize text in this paper. […]

[…] like Question answering, sentiment classification, question generation, translation, paraphrasing, summarization, […]