This dataset contains 79,45,718 (7.9 million) Telugu tweets and 1,76,54,722 (17.6 million) Hindi tweets.
Telugu: LOIT contains most (>85%) of the Telugu tweets made between 2010–01–01 and 2019–06–25. Of these, 73,78,582 are tweets with unique text.
Hindi: LOIT contains most of the Hindi tweets made between 2017–01–13 and 2018–12–31. Of these, 1,58,07,840 are tweets with unique text.
# Install tensorflow or tensorflow-gpu separately pip install loit
import loit # download data # hindi and telugu are available as of now loit.download('hindi', 'data') # download fasttext cbow vectors and read them loit.load_vectors('hindi', 'cbow') # download fasttext skipgram vectors and read them loit.load_vectors('hindi', 'skipgram') # read the jsons from data #returns iterator that yields jsons it = loit.read_data('telugu') for tweet_json in it: print(tweet_json['tweet']) input()