LOIT
Follow @bedapudi6788 Star Fork
This dataset contains 79,45,718 (7.9 million) Telugu tweets and 1,76,54,722 (17.6 million) Hindi tweets.
Telugu: LOIT contains most (>85%) of the Telugu tweets made between 2010–01–01 and 2019–06–25. Of these, 73,78,582 are tweets with unique text.
Hindi: LOIT contains most of the Hindi tweets made between 2017–01–13 and 2018–12–31. Of these, 1,58,07,840 are tweets with unique text.
Installation
# Install tensorflow or tensorflow-gpu separately
pip install loit
Usage
import loit
# download data
# hindi and telugu are available as of now
loit.download('hindi', 'data')
# download fasttext cbow vectors and read them
loit.load_vectors('hindi', 'cbow')
# download fasttext skipgram vectors and read them
loit.load_vectors('hindi', 'skipgram')
# read the jsons from data
#returns iterator that yields jsons
it = loit.read_data('telugu')
for tweet_json in it:
print(tweet_json['tweet'])
input()