Categories
csv Elasticsearch log management python Search engine

Load CSV into Elasticsearch using Python

How to load a CSV file into Elasticsearch using python? Well, its very simple using the Elasticsearch python package.

How to load a CSV file into Elasticsearch using python? Well, its very simple using the Elasticsearch python package.

1. Install elasticsearch if not already installed.

Please follow this blog post for step by step instructions on how to install and configure Elasticsearch – Install elasticsearch

Read What is Elasticsearch? If you want to understand Elasticsearch and how it works.

2. Acquire CSV dataset.

Google dataset search is a perfect tool for getting different datasets from around the world. We will use this tool to search for datasets.

Google dataset search has features like “last updated”, “download format” in addition to “usage rights” and “topics”. Therefore, it is the best tool for the dataset search.

The dataset we are going to index in Elasticsearch is from the Kaggle dataset.

This dataset is about tweets from verified users concerning stocks traded on the NYSE, NASDAQ, & SNP.

Here is the link to the CSV file. (You may need to register in Kaggle to download the dataset.) I found this dataset searching for “tweets” in the Google dataset search.

3. What is the dataset about?

This stockerbot-export.csv file contains 28k+ tweets about publicly traded companies (and a few cryptocurrencies) that are tagged with the company they are tweeting about.

Here are the columns and its description:

id: Tweet id.

text: The tweet.

timestamp: Timestamp of the tweet.

source: The source company who tweeted.

symbol: The symbol for which the “source” company tweeted.

company_names: The company name of the “symbol”.

url: The url of the tweet.

verified: Whether the tweet is verified.

4. Pre indexing checkup:

Now you acquired the dataset. Next check if elasticsearch is up and running.

The simple way to do this open your browser and enter this URL http://localhost:9200/

If you see the response similar to then you are good to go,

{
  "name" : "injustice-Lenovo-Z50-70",
  "cluster_name" : "graylog",
  "cluster_uuid" : "3e4_MpPtQE2wMXOC0FHBOQ",
  "version" : {
    "number" : "7.9.0",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "a479a2a7fce0389512d6a9361301708b92dff667",
    "build_date" : "2020-08-11T21:36:48.204330Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Next, Install the elasticsearch python package.

pip install elasticsearch

5. Load CSV to elasticsearch python code.

Import Elasticsearch client and helpers functions from elasticsearch package. Also, import csv module.

from elasticsearch import Elasticsearch, helpers
import csv

Create the elasticsearch client, which will connect to Elasticsearch.

Here two parameters passed to Elasticsearch function:

host: host address of Elasticsearch

port: port of Elasticsearch

es = Elasticsearch(host = "localhost", port = 9200)

Then, open the CSV file as DictReader of csv module and bulk upload to ealsticsearch.

with open('stockerbot-export.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index='tweets')

Note:

The three parameters passed to helpers.bulk method,

es: The elasticsearch client.

reader: The csv reader object.

index: The name of the elasticsearch index. If this index doesn’t exist in Elastisearch, then it will be automatically created.

Putting all codes together,

from elasticsearch import Elasticsearch, helpers
import csv

# Create the elasticsearch client.
es = Elasticsearch(host = "localhost", port = 9200)

# Open csv file and bulk upload
with open('stockerbot-export.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index='tweets')

6. Verify-in elasticsearch

The curl command to check if “tweets” index created as below,

curl -XGET "http://localhost:9200/_cat/indices"

You can see list of all indices along with their count of documents and disk size.

yellow open tweets                         HXmH78GtTaaN4_Q******g 1 1  28276   0  15.9mb  15.9mb

Here you can see total documents count is 28,276 and the disk size taken by index 15.9 MB.

Now lets see, if the documents indexed in “tweets” index is correct.

Run this curl command,

curl - XGET "http://localhost:9200/tweets/_search" - H 'Content-Type: application/json' - d'{  "size": 2,  "query": {    "match_all": {}  }}'

I am returning only two documents here.

It should show all the columns of the csv as below,

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "tweets",
        "_type" : "_doc",
        "_id" : "c-lgnXQBhEpcX7VCh2b4",
        "_score" : 1.0,
        "_source" : {
          "id" : "1019696670777503700",
          "text" : "VIDEO: “I was in my office. I was minding my own business...” –David Solomon tells $GS interns how he learned he wa… https://t.co/QClAITywXV",
          "timestamp" : "Wed Jul 18 21:33:26 +0000 2018",
          "source" : "GoldmanSachs",
          "symbols" : "GS",
          "company_names" : "The Goldman Sachs",
          "url" : "https://twitter.com/i/web/status/1019696670777503745",
          "verified" : "True"
        }
      },
      {
        "_index" : "tweets",
        "_type" : "_doc",
        "_id" : "dOlgnXQBhEpcX7VCh2b4",
        "_score" : 1.0,
        "_source" : {
          "id" : "1019709091038548000",
          "text" : "The price of lumber $LB_F is down 22% since hitting its YTD highs. The Macy's $M turnaround is still happening.… https://t.co/XnKsV4De39",
          "timestamp" : "Wed Jul 18 22:22:47 +0000 2018",
          "source" : "StockTwits",
          "symbols" : "M",
          "company_names" : "Macy's",
          "url" : "https://twitter.com/i/web/status/1019709091038547968",
          "verified" : "True"
        }
      }
    ]
  }
}

As you can see we have all the columns and data showing properly.

So, now its ready for analysis.

Let me know in comments section if you are stuck somewhere.

Happy learning 🙂

By Satyanarayan Bhanja

Machine learning engineer

2 replies on “Load CSV into Elasticsearch using Python”

Leave a Reply

Your email address will not be published. Required fields are marked *

How to do semantic document similarity using BERT Zero-shot classification using Huggingface transformers