Key words and phrases: social media analysis – topic extraction – graph clustering – community detection – Data polishing
Analysis and modeling of popularity dynamics of an online content has been an active area of research [1,2,3,4,5,6,7,8,9,10,11]. A popular method for extracting a topic is to collect all the tweets that mentioned a specific word (keyword) or hashtag and analyze the temporal patterns [1,2,7,10]. While this approach makes it easy to extract the emergence of topics, we can often identify various “sub-topics” within a topic intuitively. The diversity of the content may vary greatly depending on topics.
We focus on a subtopic obtained by clustering the extracted tweet data related to a keyword (i.e., a topic). We study the topic diversity defined as the number of subtopics in a topic (discussed in more detail in section 3.3). The topic models including Latent Dirichlet Allocation (LDA)  are popular method for discovering an abstract “topic” from a documents, and so have been applied to social media analysis for discovering topics or sub-topics [2,7,10,11]. In spite of their simplicity and usefulness, most topic model algorithms require users to specify the number of subtopics in advance. In this study, we cannot apply these models because we are interested in inferring the number of subtopics from data.
In this study, we investigate how topic diversity depends on the truthfulness of a topic (whether it is a rumor or a non-rumor), and how the topic diversity changes in time after a disaster. As a first step, we develop a method for quantifying the topic diversity of the tweet data based on text content. Our method is based on clustering a tweet graph using Data polishing [13,14] that automatically determines the number of subtopics. Then, the proposed method is applied to a Twitter dataset before and after the Great East Japan Earthquake of 2011. We find that the temporal patterns in topic diversity differ between rumor and non-rumor topics. Finally, we evaluate the performance of the method and compare its performance with several baselines.
The contributions of this paper are as follows:
- we propose a method for analyzing topic diversity (i.e., homogeneity of opinions in a topic) based on graph clustering.
- we compare topic diversity between rumor and non-rumor topics by applying the method to Twitter dataset before and after the Great East Japan Earthquake of 2011.
- we compare the performance of the proposed method to other existing methods.
This paper is organized as follows. Section 2 introduces related work. Section 3 describes our proposed method. We apply the proposed method to a large Twitter dataset, evaluate our method, and compare its performance with other existing method in Section 4. Finally, in Section 5, we conclude and discuss a direction for future research.
12] and its extensions [15,16] have been used to identify topics to social media data in a number of approaches [2,7]. These algorithms have the additional disadvantage of requiring the user to specify the number of topics as a parameter. Clustering word features classifies online contents by applying clustering algorithms to feature vectors. In Rosa et al. , the authors developed a method for discovering a topic by applying K-means algorithm to TF-IDF vectors calculated from the tweets and showed their method performed better than LDA. Graph clustering classifies a large amount of online content by finding community structures. In Tanev et al. , the authors showed that the utilization of word co-occurrence graph clustering improved the performance of linking news events to tweets. While previous work identified a topic of each tweet and analyzed the temporal dynamics, our work additionally investigates the diversity of the content by calculating the number of topics. For this purpose, we exploit a graph clustering approach that does not require the number of topics to be specified in advance.
Recently, it has been argued that search engines and social networks potentially facilitate “filter bubble” effects, in which machine-learning algorithms amplify ideological segregation by recommending content targeted toward a user’s preexisting opinions or biases [18,19]. In Puschmann 2019, the author pointed out that a political party or political candidate is able to exert a great influence on search results, which decreases the diversity of the content people will see when they search for information about that candidate . Some work has developed algorithms to address the diversity issue [21,22]. Interestingly, Stoyanovich et al.  proposed ranking algorithms that achieved the diversity and fairness of the results, usually with modest costs in terms of quality. While these works focus on the diversity of the input to a user on social media, we focus on the topic diversity of a population of tweets.
In this study, we find the difference in the temporal pattern of the topic diversity between rumor and non-rumor topics, which is applicable to rumor detection algorithms. Rumor detection is an active research area and several methods have been proposed [23,24,25,26,27]. While these methods are based on the machine learning methods, our method is based on clustering a graph of word co-occurrence. Unfortunately, it is difficult to select most relevant features in machine learning problems when there exist a large number of feature variables [28,29]. The graph-based approach has three potential benefits: 1) it is simple to implement, 2) it is applicable to large data set, and 3) the result is easy to interpret.
Finally, we discuss the relation between this work and our previous one . In the previous work, we proposed a method for visualizing temporal topic transition based on graph clustering. Since the method was proposed for visualization, we demonstrated the topic transition of a few topics. We neither analyzed the temporal patterns quantitatively nor examined the performance of the method. In this work, the method has been extended to quantify the content diversity of a topic by using correlation analysis and quantile regression. We also compare the temporal patterns of the topic diversity between rumor and non-rumor topics. Furthermore, we evaluate the performance of the method and demonstrate its improvement on the runtime for data processing over existing methods.
Our method consists of three steps (Figure 1): A. Construction of TweetWord Matrices, B. Tweet Graph Generation, C. Graph clustering using Data polishing.
Construction of TweetWord Matrices: Fig 1AWe analyze tweet data posted from 00:00 Japan Standard Time (JST) on March 11 to 0:00 JST on March 14, a total of 72 hours. The tweet data were divided into 144 ( ) groups based on their posted time and a fixed time window of half an hour. We construct a sequence of TweetWord matrix for each window: . TweetWord matrix () denotes the constituent word of the tweets
We generate tweet graph from the tweet data in each time window. We define a tweet graph as an undirected graph, in which a node represents a tweet and an edge represents that the connected tweets are similar (Figure 1): if the Jaccard coefficient  of two tweets is larger than the edge threshold , these nodes (tweets) are connected by an edge. The threshold was set to .
The main result (Figure 3) is robust for the small change of the threshold: the result did not change qualitatively for different thresholds .
Clustering Tweet Graph: Fig 1C
We here briefly describe our data polishing algorithm for clustering a tweet graph (See [13,14] for more detail). This algorithm iteratively increases the density of dense subgraphs, and makes sparse subgraphs sparser. As a consequence, we obtain a graph whose dense subgraphs are all cliques, and can thus easily be enumerated by a maximal clique enumeration algorithm.
The iteration is described as follows. For arbitrary two vertices and , we consider the condition
In this paper, we analyze the Tweet graph related to a topic (e.g., non-rumor topic such as “I’m OK” and rumor topics). The resulting cluster is interpreted as a subtopic and the number of clusters is interpreted as the diversity of the topic (topic diversity). For instance, suppose that there are two topics and Topic A consists of 30 subtopics (clusters) and Topic B consists of 5 subtopics (Figure 2). In this case, we interpret that Topic A is more diverse than Topic B.
We first describe the Twitter dataset. Second, we analyze the temporal pattern of the topic diversity for rumor and non-rumor topics. Finally, we compare the performance of the proposed method with the existing clustering methods.
Our data set consists of tweets posted around the time of the Great East Japan Earthquake that happened at 14:47 JST on March 11, 2011. This dataset was obtained from the social media monitoring company Hotto link Inc. , which tracked users who used one of 43 hashtags (for example, #jishin, #nhk, and #prayforjapan) or one of 21 keywords related to the disaster. Later, Hottolink collected all tweets posted by all of these users between March 9th (2 days prior to the earthquake) and March 29th. The total number of is around 200 million tweets, which offers one of a largest data set for users’ responses to a disaster. We focused on the dataset from 00:00 JST on March 11 to 0:00 JST on March 14, a total of 72 hours.
We picked out 10 topics that contain the following keywords: “do” (“suru” in Japanese), “I’m OK”, “go home”, “important”, “safe”, “damage”, “Fukushima”, “Miyagi”, “Cosmo Oil”, and “Isodine”. The collected topics consists of eight non-rumor topics (“do”, “I’m OK”, “go home”, “important”, “safe”, “damage”, “Fukushima”, and “Miyagi”) and two rumor topics (“Cosmo Oil” and “Isodine”). Note that the word “suru” (an auxiliary verb often translated as “do” or ignored when translating to English) is a common function word; a user’s usage of “suru” does not depend on the topic that they are tweeting about. Analyzing the usage of “suru” gives us a baseline for a general topic. The other words were heavily used in tweets after the East Japan Great Earthquake. Especially, the two rumor topics (“Cosmo Oil” and “Isodine”) were related to well-known rumors spread after the earthquake . A detailed description of the rumors follows.
- “Cosmo Oil”: An explosion at the Cosmo Oil plant released harmful substances into the air.
“Cosmo Oil” is the name of a Japanese oil company. The rumor about the oil tank explosion was diffused, and frightened people. A rumor topic about the explosion at the Cosmo Oil petrochemical complex progressed through the four stages:
- Fact: Around 15:00 JST on March 11 (just after the quake), the petrochemical complex in Chiba caught fire.
- Rumor: Around 19:00 JST on March 11, the following two tweets were posted and retweeted:
- Radiation and harmful chemicals are leaking into the air from the petrochemical complex. Be careful!
- Don’t go out! The rain contains radiation and harmful materials from the petrochemical complex explosion.
- Correction: Around 15:00 JST on March 12 (the day after the earthquake), the industry’s website and the local government’s twitter officially corrected the rumor.
- Disappearance: At night on March 12, the topic disappeared.
- “Isodine”: Isodine was good for protecting from radiation.
“Isodine” is the brand name of a mouthwash that includes iodopovidone. A rumor emerged that isodine protects people from radiation. It progressed in the following four stages:
- Fact: After the nuclear plant explosion occurred, twitter users expressed fear of radiative contamination.
- Rumor: Around 7:00 JST on March 12 (the day after the earthquake), the rumors about isodine’s protective benefits emerged.
- Correction: Around 15:00 JST on March 12 (the day after the earthquake), the government and isodine’s manufacturer corrected the rumor.
- Disappearance: At night on March 12, the topic gradually disappeared.
Figure 3 shows loglog scatter plot of the number of tweets and the topic diversity for 10 topics and for all the dataset. Each circle represents the number of tweets and the topic diversity at a time window. The color in a circle represents the time: the color changes from white to dark blue as the time passes. For all the dataset (Figure 3A) and non-rumor topics (Figure 3BI), we observe that the topic diversity is highly correlated with the number of tweets (Pearson correlation coefficient ranged from 0.953 to 0.996). This result suggests the power law relationship between the topic diversity and the number of tweets. For rumor topics (Figure 3J and K), we observe that the scatter plot disperses from the linear fit compared to the non-rumor ones. The topic diversity is less correlated with the number of tweets (Pearson correlation coefficient 0.958 and 0.938 for “Cosmo Oil” and “Isodine”, respectively).
To quantify the dispersion in the scatter plot, we applied the quantile regression[35,36] that fits the top and bottom data point. Figure 3 depicts the top and bottom regression lines (in blue and red, respectively) and their slopes. These slopes are close for all the dataset (Figure 3A) and non-rumor topics (Figure 3BI). In contrast, these slopes are different rumor topics (Figure 3J and K): 0.73 vs 0.59 and 0.83 vs 0.60 for “Cosmo Oil” and “Isodine”, respectively. This result indicates that the scatter plot of the rumor topics disperses compared the non-rumor topics, which is consistent with the correlation analysis. In addition, we found that the slopes of the general topic “do” (“suru” in Japanese) was smaller than that of the specific ones. A possible reason could be that 1) there are a huge difference in the number of tweets between “do” and the other topic, and 2) the slope tends to decrease when the number of tweets is large. Thus, the slope may not be useful for comparison of two topics when the number of tweets in each topic is not comparable.
Next, we observe how the topic diversity of the rumor topics changed before and after the earthquake. Figure 4A shows the time course of the topic diversity of the rumor about “Cosmo Oil”. We can see that the topic was not popular before the earthquake. While the number of tweets increased dramatically just after the earthquake, the topic diversity does not increase much, showing that the topic burst with relatively homogenous opinions, or low diversity. After the rumor correction was issued, both the number of tweets and the topic diversity increased and the topic diversity grew higher than that before the rumor correction. Figure 4B shows the time course of the topic diversity of the rumor about “Isodine”. We can see that the temporal pattern of the topic diversity is similar to that of “Cosmo Oil”. Interestingly, both rumor topics spread with low topic diversity and they were corrected with high topic diversity.
In this section, we examine the runtime and the size of subtopics generated by the proposed method and existing methods.
We examine the runtime for processing the tweet dataset by the proposed method based on Data polishing (Section 3.3), and compare the performance with four existing methods for clustering subtopics: LDA , K-means , MeanShift , and Agglomerative clustering . LDA is the most popular topic model algorithm based on word frequency across documents. K-means, MeanShift, and Agglomerative clustering are general clustering algorithms that were applied to word vectors ( ) (Section 3.1). Data polishing is a Graph clustering approach, and was applied to the tweet graph (Section 3.2). We used Nysol Python ’s implementation of Data polishing, and scikit-learn’s implementations of K-means, MeanShift, and Agglomerative Clustering , and LDA implemented in Python Gensim . All our experiments were performed on a 2018 13-inch MacBook Pro, with a 2.7 GHz Intel Core i7 with 16 GB 2133 MHz of memory.
Figure 5 illustrates the performance of the algorithms, as the number of tweets was increased from 2,000 to 30,000. These tweets were randomly extracted from the Tweet data posted during 18:30-19:00 JST on March 11. The runtime was measured by executing each method for three times and calculating their average. Note that we stopped the measurement if the runtime reached 2 hours (7,200 secs). While Data polishing and MeanShift can automatically determine the number of subtopics from the data, the other methods cannot do it. For these methods, the number of subtopics was assigned as the number that Data polishing determined. The numbers of subtopics estimated by Data polishing were smaller than those by MeanShift. The result shows that Data polishing is more efficient than the existing algorithms. The second most efficient one is LDA. The proposed method is five times faster than LDA for the sample size of 20,000 and at least six times faster than LDA for the sample size of 30,000. LDA code was terminated because the runtime exceeded 2 hours.
We randomly extracted 10,000 tweets posted during 18:30-19:00 JST on March 11, and analyzed the Tweet data by using the proposed method and the existing methods (LDA, K-means, MeanShift, and Agglomerative clustering). Table 1 shows the number of tweets in the top 5 largest subtopics. K-means and MeanShift each generated one large cluster (more than 14 % of the total tweet) as a largest subtopic. Typically, these method tend to result in one huge subtopic and many small subtopics, which often leads to a trivial clustering or difficulty for interpreting the results. In contrast, the proposed method (Data polishing) generated the smallest cluster as the largest subtopic ( 4 % of the total tweets). Though LDA and Agglomerative clustering’s largest subtopics were smaller than K-means’ and MeanShift’s, they require specifying the number of subtopics ahead of time.
In this paper, we have proposed a method for analyzing the topic diversity using Graph Clustering. After generating the tweet graph based on the similarity between tweets, we use Data polishing algorithm to obtain the clusters in the graph. We interpret a cluster as a subtopic and the number of clusters as the topic diversity, i.e., the number of subtopics in a target topic. This method was applied to a dataset of millions of tweets posted before and after the Great East Japan Earthquake of 2011. The proposed method is useful for detecting a low topic diversity situations and the rumor diffusion associates with the burst with low topic diversity, or homogeneous opinions. We have confirmed that our approach outperformed other existing clustering approaches (LDA, K-means, MeanShift, and Agglomerative clustering) in running time and results. Our method has significant applications in that it could, for example, companies or celebrities about newly-circulating rumors, and help them avoid bad publicity on social media.
The main limitation is that we focused on only 10 topics in this study. We have performed a case study of rumor diffusion on Twitter during a disaster rather than a general study of rumor diffusion. We are planning to apply the proposed method to another rumor dataset  and systematically investigate temporal patterns in topic diversity towards the development of a practical rumor detection algorithm.
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon.
What is twitter, a social network or a news media?
In Proceedings of the 19th international conference on World Wide Web, pages 591–600, 2010.
Kevin Dela Rosa, Rushin Shah, Bo Lin, Anatole Gershman, and Robert Frederking.
Topical clustering of tweets.
Proceedings of the ACM SIGIR: SWSM, 63, 2011.
Yasuko Matsubara, Yasushi Sakurai, B Aditya Prakash, Lei Li, and Christos
Rise and fall patterns of information diffusion: model and implications.
In Proceedings of the 18th ACM SIGKDD international conference on Knowledge Discovery and Data mining, pages 6–14, 2012.
Takako Hashimoto, Dave Shepard, Tetsuji Kuboyama, and Kilho Shin.
Event detection from millions of tweets related to the great east japan earthquake using feature selection technique.
In 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pages 7–12. IEEE, 2015.
Keisuke Ikeda, Takeshi Sakaki, Fujio Toriumi, and Satoshi Kurihara.
An examination of a novel information diffusion model: Considering of twitter user and twitter system features.
In International Conference on Autonomous Agents and Multiagent Systems, pages 180–191. Springer, 2016.
Ryota Kobayashi and Renaud Lambiotte.
Tideh: Time-dependent hawkes process for predicting retweet dynamics.
In Tenth International AAAI Conference on Web and Social Media, 2016.
Przemyslaw A Grabowicz, Niloy Ganguly, and Krishna P Gummadi.
Distinguishing between topical and non-topical information diffusion mechanisms in social media.
In Tenth International AAAI Conference on Web and Social Media, 2016.
Hongshan Jin, Masashi Toyoda, and Naoki Yoshinaga.
Can cross-lingual information cascades be predicted on twitter?
In International Conference on Social Informatics, pages 457–472. Springer, 2017.
Julia Proskurnia, Przemyslaw Grabowicz, Ryota Kobayashi, Carlos Castillo,
Philippe Cudré-Mauroux, and Karl Aberer.
Predicting the success of online petitions leveraging multidimensional time-series.
In Proceedings of the 26th International Conference on World Wide Web, pages 755–764, 2017.
Takako Hashimoto, Takeaki Uno, Tetsuji Kuboyama, Kilho Shin, and Dave Shepard.
Time series topic transition based on micro-clustering.
In 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 1–8. IEEE, 2019.
David Ifeoluwa Adelani, Ryota Kobayashi, Ingmar Weber, and Przemyslaw A
Estimating community feedback effect on topic choice in social media with predictive modeling.
EPJ Data Science, 9(1):25, 2020.
David M Blei, Andrew Y Ng, and Michael I Jordan.
Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022, 2003.
Takeaki Uno, Hiroki Maegawa, Takanobu Nakahara, Yukinobu Hamuro, Ryo Yoshinaka,
and Makoto Tatsuta.
Micro-clustering: Finding small clusters in large diversity.
arXiv preprint arXiv:1507.03067, 2015.
Takeaki Uno, Hiroki Maegawa, Takanobu Nakahara, Yukinobu Hamuro, Ryo Yoshinaka,
and Makoto Tatsuta.
Micro-clustering by data polishing.
In 2017 IEEE International Conference on Big Data (Big Data), pages 1012–1018. IEEE, 2017.
David M Blei and John D Lafferty.
Dynamic topic models.
In Proceedings of the 23rd international conference on Machine learning, pages 113–120, 2006.
Yu Wang, Eugene Agichtein, and Michele Benzi.
Tm-lda: efficient online modeling of latent topic transitions in social media.
In Proceedings of the 18th ACM SIGKDD international conference on Knowledge Discovery and Data mining, pages 123–131, 2012.
Hristo Tanev, Maud Ehrmann, Jakub Piskorski, and Vanni Zavarella.
Enhancing event descriptions through twitter mining.
In Sixth International AAAI Conference on Weblogs and Social Media, 2012.
The filter bubble: What the Internet is hiding from you.
Penguin UK, 2011.
Seth Flaxman, Sharad Goel, and Justin M Rao.
Filter bubbles, echo chambers, and online news consumption.
Public opinion quarterly, 80(S1):298–320, 2016.
Beyond the bubble: Assessing the diversity of political search results.
Digital Journalism, 7(6):824–843, 2019.
Julia Stoyanovich, Ke Yang, and HV Jagadish.
Online set selection with fairness and diversity constraints.
In Proceedings of the EDBT Conference, 2018.
Maxim Charkov and Surabhi Gupta.
Re-ranking search results for location refining and diversity, July 11 2019.
US Patent App. 16/356,811.
Sejeong Kwon, Meeyoung Cha, Kyomin Jung, Wei Chen, and Yajun Wang.
Prominent features of rumor propagation in online social media.
In 2013 IEEE 13th International Conference on Data Mining, pages 1103–1108. IEEE, 2013.
Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong
Sak Hoi, and Arkaitz Zubiaga.
SemEval-2017 task 8: RumourEval: Determining rumour veracity and support for rumours.
In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 69–76, 2017.
Lahari Poddar, Wynne Hsu, Mong Li Lee, and Shruti Subramaniyam.
Predicting stances in twitter conversations for detecting veracity of rumors: A neural approach.
In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pages 65–72. IEEE, 2018.
Jing Ma, Wei Gao, and Kam-Fai Wong.
Rumor detection on twitter with tree-structured recursive neural networks.
In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1980–1989, 2018.
Zen Yoshida and Masayoshi Aritsugi.
Rumor detection in twitter with social graph structures.
In Third International Congress on Information and Communication Technology, pages 589–598. Springer, 2019.
Shenkai Gu, Ran Cheng, and Yaochu Jin.
Feature selection for high-dimensional classification using a competitive swarm optimizer.
Soft Computing, 22(3):811–822, 2018.
Adam P Piotrowski and Jaroslaw J Napiorkowski.
Some metaheuristics should be simplified.
Information Sciences, 427:32–62, 2018.
Mecab: Yet another part-of-speech and morphological analyzer.
http://mecab. sourceforge. jp, 2006.
The distribution of the flora in the alpine zone. 1.
New phytologist, 11(2):37–50, 1912.
Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan.
Introduction to information retrieval.
In Proceedings of the international communication of association for computing machinery conference, page 260, 2008.
Keita Nabeshima, Junta Mizuno, Naoaki Okazaki, and Kentaro Inui.
Mining false information on twitter for a major disaster situation.
In International Conference on Active Media Technology, pages 96–109, 2014.
Roger Koenker and Kevin F Hallock.
Journal of economic perspectives, 15(4):143–156, 2001.
David Arthur and Sergei Vassilvitskii.
K-means++: The advantages of careful seeding.
In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035, 2007.
Keinosuke Fukunaga and Larry Hostetler.
The estimation of the gradient of a density function, with applications in pattern recognition.
IEEE Transactions on information theory, 21(1):32–40, 1975.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
The elements of statistical learning: data mining, inference, and prediction.
Springer Science & Business Media, 2009.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, et al.
Scikit-learn: Machine learning in python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
Radim Řehůřek and Petr Sojka.
Software Framework for Topic Modelling with Large Corpora.
In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, 2010.
Quanzhi Li, Qiong Zhang, Luo Si, and Yingchi Liu.
Rumor detection on social media: Datasets, methods and opportunities.
In Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda, pages 66–75, 2019.