Twitter has set itself apart as a direct communication medium to a very large audience. Tweets are precise and convey the message in a snappy manner and this has led to Twitter becoming so popular that it is now influencing global landscapes.
Twitter datasets can be effectively utilized in the areas of academic research, social projects and studying marketing methodologies. I have compiled a list of various Twitter data set archives accumulated from various sources which can be very effective for someone looking for a reliable source of Twitter data.
I have also mentioned a method to get specific Twitter historical data of any kind, but first, let’s discuss all the free and reliable sources.
Type of data- Miscellaneous research data(2013-2018).
This is a collection of Twitter data gathered through stream for the purpose of research, history, testing and data retention. We can go through loads of data in this archive and purposefully select the data stream we need.These archives have loads of data which can be sorted and used as needed. The datasets available here can be downloaded for free.
Type of data- MNC’s Twitter accounts and influential people.
Data.world is a free Twitter dataset repository. Users can find datasets ranging from companies to influential individuals. We can simply head over to the website and browse through their collection of data sets.
Type of data- Russian troll tweets to celebrity accounts.
Like all things on Github, this is a free data repository. The datasets range from Elon musk Tweets to Russian troll tweets. Users can simply head over to the mentioned URL and browse through their vast collection of Twitter datasets.
Type of data- Scientific research data.
Kaggle is a free online repository for sharing codes, scientific data and Twitter data sets as well. There is a huge collection of data sets submitted by users which are available to download for free. The data ranges from environmental studies to tweets from demonetization in India.
Type of data- Academic research data.
ICWSM is a data sharing initiative which has a vast collection of Twitter data sets. The collection is free to download and the users only have to register on the website and sign a disclosure under which he/she agrees not to share the data. These data sets can be extremely beneficial in the field of academic research.
Type of data- Data related to real world events.
This collection includes a collection of 30 different data sets associated with real world events and were collected between 2012 and 2016, using the streaming API with a set of keywords. As per Twitter TOS, all of this data is available for non-commercial purposes only.
Type of data- Old Twitter data from October 2010.
This data set contains tweets which were posted on Twitter in October 2010. Although quite old, this might still be relevant to data minors and academicians. Just click on the link to download the dataset
Type of data- Sample of 16 million unfiltered tweets.
This archive consists of approximately 16 million tweets sampled between January 23rd to February 8th. This is an unfiltered archive and consists of important and spam tweets as well. The user just needs to sign a disclosure agreeing not to use the data for commercial purposes and after that, we can download the archive right away.
Type of data- Miscellaneous.
Kdnuggets is a multi centric portal which provides information on jobs, relevant courses, webinars and free downloadable Twitter data sets as well. You can go directly to the link provided and browse through their collection of datasets.
10. Github troll tweets
Type of data- Russian troll tweets.
This github archive provides a large dataset of Russian troll tweets. All the data sets are readily downloadable in CSV format.
11. Github scraped public tweets
Type of data- Miscellaneous public tweets.
This dataset is a collection of scraped public twitter updates used in coordination with an academic project to study the geolocation data related to the tweets.
12. Mega.nz Reddit data set
Type of data- Reddit comments data set.
This is the data set of entire Reddit’s publically available comment data set which can be used for massive analytical research. The file size is about 250 GB compressed and over 1 TB uncompressed. The link provided is of the torrent file which can be easily downloaded using a torrent client.
13. Kaggle customer support data sets
Type of data- Customer support tweets.
This data set consists of over 3 Million tweets by customer support of various big brands and companies. This can be used in understanding conversational models, and for study of modern customer support practices and impact.
Type of data- Tweets from NASDAQ companies to UK geolocation tweet data.
Follow the hashtag provides a collection of data sets ranging from top 100 NASDAQ companies to UK geolocation tweet data. Just click on the link browse the datasets.
Type of data- Miscellaneous.
Lionbridge provides a comprehensive list of Twitter data sets which range from everyday news to tweets with the hashtag #Avengersendgame and so on. Just click on the link and browse through the list of their available datasets.
16. Academic torrents
Type of data- URL’s posted on Twitter on October 2010.
This dataset consists of URL’s that were posted on Twitter in October 2010. The link will take you to the Torrent file which can be easily downloaded through a Torrent client.
17. Social computing
Type of data- Data from friendship and fellowship network.
This dataset is being provided by Arizona state university and it consists of Twitter data of friendship/fellowship networks.
Type of data- Tweet sentiment analysis data.
Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter. It filters through the tweets by understanding the negativity or positivity of the tweet or comment by analyzing emoticons.
Type of data- Miscellaneous.
Docnow provides catalog of datasets that are publicly available on the web. If you would like to turn these tweet identifier datasets back into the original JSON format then first download the dataset and then use the Hydrator desktop application, or Twarc if you are comfortable working at the command line.
20. Harvard dataverse
Type of data- USA presidential election tweets.
This dataset contains the tweet ids of approximately 280 million tweets related to the 2016 United States presidential election. They were collected between July 13, 2016 and November 10, 2016 from the Twitter API using Social Feed Manager.
Type of data- Miscellaneous.
This Twitter dataset contains all 40,815,975 tweets matching at least one of the following 45 keywords that were posted between June 1, 2014 and May 31, 2015 and had not been deleted or protected as of July 2015. Head over to the link to find the list of the 45 keywords and download the data set.
For customized Twitter dataset requirements
There are instances when a more specific Twitter dataset is required. This is where TrackMyHashtag comes in. It is an amazing tool which allows us to download a more targeted set of data. This is a paid platform and their prices start from just $30 and varies as per the required data set.
Submit the request for your required Twitter dataset: Historical Twitter dataset request form
TrackMyHashtag is an amazing tool which allows you to download customized Twitter datasets. TMH employs advanced, AI-based tracking tools which allow you to track historical as well as real time Twitter data related to any hashtag/keyword/account. The features of TrackMyHashtag are-
1. Historical Twitter data of any time period
2. Twitter data set related to any hashtag, keyword, account or search term
3. Geo-location based Twitter data
4. Language based Twitter data
Following are the metadata present in TrackMyHashtag’s data sheet of historical Twitter data,
- Tweet ID, URL and posted time.
- Tweet content.
- Tweet type and Tweet source.
- Retweets and likes received.
- Tweet location and language.
- User ID, name, username, bio, profile URL, followers, following and account creation date of the Twitter user who posted the tweet.
- Twitter account’s verification and protected status.
Data is essential for research and various other academic purposes. I have included most of the Twitter data sets that had working download links. Due to changes in the TOS of Twitter, it is becoming increasingly difficult to acquire this data however, all the links mentioned in this article are working as of writing this article. Happy mining!