LOIT

This dataset contains 79,45,718 (7.9 million) Telugu tweets and 1,76,54,722 (17.6 million) Hindi tweets.

Telugu: LOIT contains most (>85%) of the Telugu tweets made between 2010–01–01 and 2019–06–25. Of these, 73,78,582 are tweets with unique text.

Hindi: LOIT contains most of the Hindi tweets made between 2017–01–13 and 2018–12–31. Of these, 1,58,07,840 are tweets with unique text.

Installation

# Install tensorflow or tensorflow-gpu separately
pip install loit

Usage

import loit

# download data
# hindi and telugu are available as of now
loit.download('hindi', 'data')

# download fasttext cbow vectors and read them 
loit.load_vectors('hindi', 'cbow')

# download fasttext skipgram vectors and read them
loit.load_vectors('hindi', 'skipgram')

# read the jsons from data
#returns iterator that yields jsons
it = loit.read_data('telugu')

for tweet_json in it:
    print(tweet_json['tweet'])
    input()

Learn More