Albert BERT data science Haystack Huggingface Machine learning NLP paraphrase python question answering Search engine seq to seq Sqlite transformers

QA system using Haystack and InMemoryDocumentStore

Haystack is a scalable QA system to search in large collections of documents. See how to build a QA system using Haystack and InMemoryDocumentStore.

Haystack is a scalable QA system to search in large collections of documents. See how to build a QA system using Haystack and InMemoryDocumentStore.

1. What is Haystack?

Haystack is a large scale Question Answer framework to search in large collections of documents.

2. How it is different from modern Question Answering Models like BERT, ALBERT …?

BERT/Distilbert models finetuned on QA tasks are designed to find answers within small text documents.
Also, the model size, compute cost, and latency is high for a large set of documents for QA.

3. What are the core features of the Haystack?

i. Ability to use the latest SOTA transformer, models. (Like BERT, Roberta, Albert)
ii. Because of the modular nature, you can replace a new SOTA model easily.
iii. Simpler to use and modify according to your use case.
iv. It has prediction ready capabilities using Elasticsearch and can be integrated with REST APIs.
v. You can fine-tune this QA pipeline to your own domain. Also, the option to improve with user feedback.

4. What are the components of Haystack?

There are two major components of Haystack: DocumentStore and Finder.
Finder is a pipeline that combines Retriever and Reader.

5. What is a DocumentStore in Haystack?

DocumentStore is the database where documents will be stored for search.
For POC purpose SQLite or in-memory document store available.
But, Elasticsearch is recommended for production purposes.

6. What is a Retriever in Haystack?

Retriever narrows down to the candidate passages from a large collection of documents quickly.
Like Retriever will give 100 passages out of million documents.
The algorithm used can be BM25, TF-IDF, Elasticsearch queries, and embedding based approaches.
Basically, Retriever narrows down to a small number of passages where the answer will be found out.

7. What is a Reader in Haystack?

Readers use models like BERT/Roberta which are fine-tuned on SQuAD like tasks.
It takes multiple passages returned by retriever and then returns top n answers along with their confidence score.

8. Can you fine-tune for your own domain data?

Yes, you can finetune the transformer models to your own domain data or use any Huggingface transformers models.

Now, let’s start building a QA system using Haystack and InMemoryDocumentStore/Sqlite

  1. Install Haystack.
# Install latest release of Haystack
!pip install git+

2. Import required Libraries,

from haystack import Finder
from import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

3. Create a Document store.

There are two ways to store documents. In-memory and Sqlite
Use any one of these codes below.

# In-Memory Document Store
from haystack.document_store.memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
# SQLite Document Store
# from haystack.document_store.sql import SQLDocumentStore
# document_store = SQLDocumentStore(url="sqlite:///qa.db")

4. A customizable pipeline is given by Haystack for: 

  • Transforming files into texts 
  • Cleaning texts.
  • Texts Splitting 
  • Write to the document store.
  • In this tutorial, we download articles from Wikipedia, Game of Thrones.
  • We will apply a simple cleaning function in memory/Sqlite.

5. We will download Wikipedia data for Game of Thrones. There are around 517 articles.

doc_dir = "data/article_txt_got"
s3_url = ""
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

6. Convert files to dictionary.

Convert files to dictionaries that hold records that can be indexed in our data store. 
Optionally, you may have a cleaning feature that is applied to each document (for example, to remove footers). 
As data, it must take str, and return str.

dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

We now have a list of dictionaries that we can write down in our store of documents. 
Of course, you can skip converting files to dicts) (and build the dictionaries yourself if your texts come from another source ( e.g. a DB). 
“In this case, the default format is:” {name:” “< some-document-name >, “text”: “< actual-text >”}

Let’s have a look at the first three entries.

[{'text': "'''''Catch the Throne''''' is a two-volume mixtape. The first volume was released digitally on June 10, 2014, and on CD on July 1, 2014 as a free mix tape that features various rap artists to help promote the HBO series ''Game of Thrones''. The albums feature hip hop artists including Snoop Dogg, Ty Dolla $ign, Common, Wale, Daddy Yankee, as well as music by Ramin Djawadi from the show and some voices from the show.", 'meta': {'name': '52_Catch_the_Throne.txt'}}, {'text': '\n==Reception==\nThe album received mostly mixed reviews from critics and fans alike.', 'meta': {'name': '52_Catch_the_Throne.txt'}}, {'text': "\n=== ''Volume I'' ===\nTo help promote the series to a broader audience including multicultural urban youth, HBO commissioned an album of rap songs dedicated to ''Game of Thrones''.  Entitled ''Catch the Throne'', it was published for free on SoundCloud on March 7, 2014.", 'meta': {'name': '52_Catch_the_Throne.txt'}}]

7. Let’s write the docs to our db/in-memory store.


Initalize Retriever, Reader, & Finder

  1. Retriever:
  • With InMemoryDocumentStore or SQLDocumentStore, we’ll use TfidfRetriever.
  • In-memory TfidfRetriever based on data frames from Pandas.
from haystack.retriever.sparse import TfidfRetriever
retriever = TfidfRetriever(document_store=document_store)

2. Reader

A reader scans in depth the texts returned by retrievers and extracts the best k responses.
Such readers offer optimal results but are slower models of deep learning.

Readers focused on the FARM and Transformers frameworks are currently provided by Haystack. You can either load a local model or one (Huggingface models) from Hugging Face’s model hub.

Here: a RoBERTa QA model of medium size using a FARM-based Reader used.

Note: With the no_ans_boost, you can change the model to return “no response possible.” Higher values mean that “no answer possible” preferred by the model.

If you want to use FARMReader:

# Load a  local model or any of the QA models on
# Hugging Face's model hub (

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

If you want to use TransformersReader:

# Alternative:
# reader = TransformersReader(model="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

3. Finder

To answer our real questions, The Finder combines the reader and the retriever in a pipeline.

finder = Finder(reader, retriever)

You can specify how many candidates are returned by the reader and retriever.
The higher the top_k_retriever, the better your answers (but also the slower).

prediction = finder.get_answers(question="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)

It will show logs like this,

10/17/2020 05:04:58 - INFO - haystack.finder -   Reader is looking for a detailed answer in 12569 chars ...

Now lets see the answer,

print_answers(prediction, details="minimal")
[   {   'answer': 'Eddard and Catelyn Stark',
        'context': 'tark ===\n'
                   'Arya Stark is the third child and younger daughter of '
                   'Eddard and Catelyn Stark. She serves as a POV character '
                   "for 33 chapters throughout ''A "},
    {   'answer': 'Lord Eddard Stark',
        'context': 'ark daughters.\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Lord Eddard and Catelyn Stark',
        'context': 'rk of House Stark is the younger daughter and third child '
                   'of Lord Eddard and Catelyn Stark of Winterfell. Ever the '
                   'tomboy, Arya would rather be traini'},
    {   'answer': 'Robert Baratheon',
        'context': 'hen Gendry gives it to Arya, he tells her he is the '
                   'bastard son of Robert Baratheon. Aware of their chances of '
                   'dying in the upcoming battle and Arya w'},
    {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'}]

It shows the answers along with the passage it extracted answer from.

Next question,

prediction = finder.get_answers(question="Who is the sister of Jon snow?", top_k_reader=5)
print_answers(prediction, details="minimal")
[   {   'answer': 'Arya Stark',
        'context': 'at a Northern girl is in trouble, who Jon assumes is his '
                   'half-sister, Arya Stark. Mance is revealed to be alive '
                   "thanks to Melisandre's magical tricker"},
    {   'answer': 'Arya',
        'context': ' the support of the Glovers and the Mormonts. Jon learns '
                   'that his sister Arya is being married to Ramsay Bolton so '
                   'that the Boltons may claim Winterfe'},
    {   'answer': 'Daenerys',
        'context': ', Sansa and many Northern lords are livid over Jon bending '
                   'the knee to Daenerys, with Sansa accusing him of being in '
                   "love with her. Jon's bond with Da"},
    {   'answer': 'Alys Karstark',
        'context': "scue Arya. However, the girl in Melisandre's visions turns "
                   'out to be Alys Karstark, a young noblewoman fleeing to the '
                   'Wall to escape her treacherous u'},
    {   'answer': 'Ygritte',
        'context': 'ic to their cause and becoming romantically involved with '
                   'the tenacious Ygritte.  However he ultimately betrays them '
                   'to defend The Wall. Later, as the'}]

Another question,

prediction = finder.get_answers(question="Who killed Renly Baratheon?", top_k_retriever=10, top_k_reader=5)
print_answers(prediction, details="minimal")
[   {   'answer': 'Melisandre',
        'context': 'Before Catelyn can offer a real negotiation, Renly is '
                   'assassinated by Melisandre, who gives birth to a shadow '
                   'demon and sends it to kill Renly in orde'},
    {   'answer': 'Melisandre',
        'context': "ecome Stannis' heir. Before the battle he is assassinated "
                   'by a shadow conjured by Melisandre, though it is unclear '
                   'if Stannis is aware of this or not.'},
    {   'answer': 'shadow creature',
        'context': "t Stannis, and then Renly's subsequent murder later that "
                   'night by a shadow creature. Afterwards, Catelyn flees with '
                   "Brienne of Tarth, one of Renly's k"},
    {   'answer': 'Stannis',
        'context': "tly killed by a shadow demon with the face of Renly's "
                   'brother and rival Stannis, who has discovered he is the '
                   'rightful heir to House Baratheon, but wh'},
    {   'answer': 'Melisandre',
        'context': ' next day. Renly is subsequently assassinated by a shadow '
                   "conjured by Melisandre using Stannis' life force, and many "
                   "of Renly's bannermen immediately "}]

Here is the Colab link for this blog post.

Let me know in the comment section if you are facing any issues.

My other blogs about Question answer systems,

Question answering using transformers and BERT.

Text2TextGeneration pipeline by Huggingface transformers.

Faster transformer NLP pipeline using ONNX.

Follow me on Twitter, Instagram, Pinterest, and Tumblr for new post notification.

By Satyanarayan Bhanja

Machine learning engineer

Leave a Reply

Your email address will not be published. Required fields are marked *