Dave Shepard
shepard.david@gmail.com

Tetsuji Kuboyama
ori-bigdata2019@tk.cc.gakushuin.ac.jp

Ryota Kobayashi
r-koba@edu.k.u-tokyo.ac.jp

Takeaki Uno
uno@nii.jp

Key words and phrases: social media analysis – topic extraction – graph clustering – community detection – Data polishing

Abstract:

During a disaster, social media can be both a source of help and of danger: social media has a potential to diffuse rumors, and officials involved in disaster mitigation must react quickly to the spread of rumor on social media. In this paper, we investigate how topic diversity (i.e., homogeneity of opinions in a topic) depends on the truthfulness of a topic (whether it is a rumor or a non-rumor), and how the topic diversity changes in time after a disaster. To do so, we develop a method for quantifying the topic diversity of the tweet data based on text content. The proposed method is based on clustering a tweet graph using Data polishing that automatically determines the number of subtopics. We perform a case study of tweets posted after the East Japan Great Earthquake on March 11, 2011. We find that rumor topics exhibit more homogeneity of opinions in a topic during diffusion than non-rumor topics. Furthermore, we evaluate the performance of our method and demonstrate its improvement on the runtime for data processing over existing methods.

Introduction

After the East Japan Great Earthquake on 11 March, 2011, Twitter users reacted quickly and discussed a variety of topics both real and imaginary. An example is the rumor about an explosion at a petrochemical complex owned by Cosmo Oil Co., Ltd. Stories of oil tanks exploding and releasing harmful substances into the air caused widespread panic until official government announcement released on the following day. Social media has the potential to be a source both of help and trouble during disasters. Constructing strategies for disaster mitigation requires addressing issues that arise from social media as well.

Analysis and modeling of popularity dynamics of an online content has been an active area of research [1,2,3,4,5,6,7,8,9,10,11]. A popular method for extracting a topic is to collect all the tweets that mentioned a specific word (keyword) or hashtag and analyze the temporal patterns [1,2,7,10]. While this approach makes it easy to extract the emergence of topics, we can often identify various “sub-topics” within a topic intuitively. The diversity of the content may vary greatly depending on topics.

We focus on a subtopic obtained by clustering the extracted tweet data related to a keyword (i.e., a topic). We study the topic diversity defined as the number of subtopics in a topic (discussed in more detail in section 3.3). The topic models including Latent Dirichlet Allocation (LDA) [12] are popular method for discovering an abstract “topic” from a documents, and so have been applied to social media analysis for discovering topics or sub-topics [2,7,10,11]. In spite of their simplicity and usefulness, most topic model algorithms require users to specify the number of subtopics in advance. In this study, we cannot apply these models because we are interested in inferring the number of subtopics from data.

In this study, we investigate how topic diversity depends on the truthfulness of a topic (whether it is a rumor or a non-rumor), and how the topic diversity changes in time after a disaster. As a first step, we develop a method for quantifying the topic diversity of the tweet data based on text content. Our method is based on clustering a tweet graph using Data polishing [13,14] that automatically determines the number of subtopics. Then, the proposed method is applied to a Twitter dataset before and after the Great East Japan Earthquake of 2011. We find that the temporal patterns in topic diversity differ between rumor and non-rumor topics. Finally, we evaluate the performance of the method and compare its performance with several baselines.

The contributions of this paper are as follows:

we propose a method for analyzing topic diversity (i.e., homogeneity of opinions in a topic) based on graph clustering.
we compare topic diversity between rumor and non-rumor topics by applying the method to Twitter dataset before and after the Great East Japan Earthquake of 2011.
we compare the performance of the proposed method to other existing methods.

This paper is organized as follows. Section 2 introduces related work. Section 3 describes our proposed method. We apply the proposed method to a large Twitter dataset, evaluate our method, and compare its performance with other existing method in Section 4. Finally, in Section 5, we conclude and discuss a direction for future research.

Related Work

It is essential to discover a higher-level “topic” underlying the online content for summarizing and finding subtopics of collected data (e.g., tweets or hashtags). There are three approaches for topic identification: 1) topic models, 2) clustering word features, and 3) graph clustering. Topic models extract a latent topic from the frequency of words in a tweet by using a probabilistic model. LDA [12] and its extensions [15,16] have been used to identify topics to social media data in a number of approaches [2,7]. These algorithms have the additional disadvantage of requiring the user to specify the number of topics as a parameter. Clustering word features classifies online contents by applying clustering algorithms to feature vectors. In Rosa et al. [2], the authors developed a method for discovering a topic by applying K-means algorithm to TF-IDF vectors calculated from the tweets and showed their method performed better than LDA. Graph clustering classifies a large amount of online content by finding community structures. In Tanev et al. [17], the authors showed that the utilization of word co-occurrence graph clustering improved the performance of linking news events to tweets. While previous work identified a topic of each tweet and analyzed the temporal dynamics, our work additionally investigates the diversity of the content by calculating the number of topics. For this purpose, we exploit a graph clustering approach that does not require the number of topics to be specified in advance.

Recently, it has been argued that search engines and social networks potentially facilitate “filter bubble” effects, in which machine-learning algorithms amplify ideological segregation by recommending content targeted toward a user’s preexisting opinions or biases [18,19]. In Puschmann 2019, the author pointed out that a political party or political candidate is able to exert a great influence on search results, which decreases the diversity of the content people will see when they search for information about that candidate [20]. Some work has developed algorithms to address the diversity issue [21,22]. Interestingly, Stoyanovich et al. [21] proposed ranking algorithms that achieved the diversity and fairness of the results, usually with modest costs in terms of quality. While these works focus on the diversity of the input to a user on social media, we focus on the topic diversity of a population of tweets.

In this study, we find the difference in the temporal pattern of the topic diversity between rumor and non-rumor topics, which is applicable to rumor detection algorithms. Rumor detection is an active research area and several methods have been proposed [23,24,25,26,27]. While these methods are based on the machine learning methods, our method is based on clustering a graph of word co-occurrence. Unfortunately, it is difficult to select most relevant features in machine learning problems when there exist a large number of feature variables [28,29]. The graph-based approach has three potential benefits: 1) it is simple to implement, 2) it is applicable to large data set, and 3) the result is easy to interpret.

Finally, we discuss the relation between this work and our previous one [10]. In the previous work, we proposed a method for visualizing temporal topic transition based on graph clustering. Since the method was proposed for visualization, we demonstrated the topic transition of a few topics. We neither analyzed the temporal patterns quantitatively nor examined the performance of the method. In this work, the method has been extended to quantify the content diversity of a topic by using correlation analysis and quantile regression. We also compare the temporal patterns of the topic diversity between rumor and non-rumor topics. Furthermore, we evaluate the performance of the method and demonstrate its improvement on the runtime for data processing over existing methods.

Proposed Method

Our method consists of three steps (Figure 1): A. Construction of Tweet $-$ Word Matrices, B. Tweet Graph Generation, C. Graph clustering using Data polishing.

**Figure 1:** Proposed Method. Our method consists of three steps. A. Construction of TweetWord Matrices. B. Tweet Graph Generation. C. Clustering Tweet Graph.
$\includegraphics[width=6in]{Fig1.pdf}$

Construction of TweetWord Matrices: Fig 1A

We analyze tweet data posted from 00:00 Japan Standard Time (JST) on March 11 to 0:00 JST on March 14, a total of 72 hours. The tweet data were divided into 144 ( $= 72 \div 0.5$ ) groups based on their posted time and a fixed time window of half an hour. We construct a sequence of Tweet $-$

Word matrix for each window: $\langle \mathit{W}_1, W_2, \cdots W_{144} \rangle$ . Tweet $-$

Word matrix ( $W$

) denotes the constituent word of the tweets

$\displaystyle W= \begin{pmatrix} \mathit{w}_{11}&\mathit{w}_{12} &\cdots &\math... ...\vdots\\ \mathit{w}_{m1}&\mathit{w}_{m2}&\cdots &\mathit{w}_{mn} \end{pmatrix}$

where

and

are the number of tweets and words during a time window, respectively. The element $\mathit{w}_{ij}$ is 1 if $i$

-th tweet contains the $j$

-th word, and 0 otherwise. We used the morphological analyzer MeCab [30] to perform Japanese word segmentation.

Tweet Graph Generation: Fig 1B

We generate tweet graph from the tweet data in each time window. We define a tweet graph as an undirected graph, in which a node represents a tweet and an edge represents that the connected tweets are similar (Figure 1): if the Jaccard coefficient [31] of two tweets is larger than the edge threshold $\theta_E$ , these nodes (tweets) are connected by an edge. The threshold was set to $\theta_E= 0.3$ .

The main result (Figure 3) is robust for the small change of the threshold: the result did not change qualitatively for different thresholds $\theta_E= 0.2$ .

Clustering Tweet Graph: Fig 1C

We here briefly describe our data polishing algorithm for clustering a tweet graph (See [13,14] for more detail). This algorithm iteratively increases the density of dense subgraphs, and makes sparse subgraphs sparser. As a consequence, we obtain a graph whose dense subgraphs are all cliques, and can thus easily be enumerated by a maximal clique enumeration algorithm.

The iteration is described as follows. For arbitrary two vertices $u$ and $v$ , we consider the condition

$$\displaystyle \left\vert N[u] \cap N[v]\right\vert / \left\vert N[u] \cup N[v]\right\vert > \theta_P, $”> </div><p></p> <p> This intuitively means that <span class=$ $u$

and

belong to the same cluster, since $u$

and

‘s neighbor sets must overlap considerably when they belong to the same cluster. We construct a new graph by using this condition so that two vertices are connected when this condition is met in the original graph. In the other words, two vertices are connected in the new graph if and only if the vertices seem to belong to the same cluster. Data polishing applies this graph reconstruction iteratively until it does not change. Maximal clique enumeration is performed with an algorithm such as MACE[32], to obtain the clusters of the resulting graph. The threshold $\theta_P$ was set to $0.2$

. The threshold value affects the clustering result. The size of clusters increases as the threshold $\theta_P$ decreases.

In this paper, we analyze the Tweet graph related to a topic (e.g., non-rumor topic such as “I’m OK” and rumor topics). The resulting cluster is interpreted as a subtopic and the number of clusters is interpreted as the diversity of the topic (topic diversity). For instance, suppose that there are two topics and Topic A consists of 30 subtopics (clusters) and Topic B consists of 5 subtopics (Figure 2). In this case, we interpret that Topic A is more diverse than Topic B.

**Figure 2:** Interpretation of the number of the clusters (# of clusters).
$\includegraphics[width=6in]{Fig2.pdf}$

Experiments

We first describe the Twitter dataset. Second, we analyze the temporal pattern of the topic diversity for rumor and non-rumor topics. Finally, we compare the performance of the proposed method with the existing clustering methods.

Dataset

Our data set consists of tweets posted around the time of the Great East Japan Earthquake that happened at 14:47 JST on March 11, 2011. This dataset was obtained from the social media monitoring company Hotto link Inc. [33], which tracked users who used one of 43 hashtags (for example, #jishin, #nhk, and #prayforjapan) or one of 21 keywords related to the disaster. Later, Hottolink collected all tweets posted by all of these users between March 9th (2 days prior to the earthquake) and March 29th. The total number of is around 200 million tweets, which offers one of a largest data set for users’ responses to a disaster. We focused on the dataset from 00:00 JST on March 11 to 0:00 JST on March 14, a total of 72 hours.

We picked out 10 topics that contain the following keywords: “do” (“suru” in Japanese), “I’m OK”, “go home”, “important”, “safe”, “damage”, “Fukushima”, “Miyagi”, “Cosmo Oil”, and “Isodine”. The collected topics consists of eight non-rumor topics (“do”, “I’m OK”, “go home”, “important”, “safe”, “damage”, “Fukushima”, and “Miyagi”) and two rumor topics (“Cosmo Oil” and “Isodine”). Note that the word “suru” (an auxiliary verb often translated as “do” or ignored when translating to English) is a common function word; a user’s usage of “suru” does not depend on the topic that they are tweeting about. Analyzing the usage of “suru” gives us a baseline for a general topic. The other words were heavily used in tweets after the East Japan Great Earthquake. Especially, the two rumor topics (“Cosmo Oil” and “Isodine”) were related to well-known rumors spread after the earthquake [34]. A detailed description of the rumors follows.

“Cosmo Oil”: An explosion at the Cosmo Oil plant released harmful substances into the air.
“Cosmo Oil” is the name of a Japanese oil company. The rumor about the oil tank explosion was diffused, and frightened people. A rumor topic about the explosion at the Cosmo Oil petrochemical complex progressed through the four stages:

Fact: Around 15:00 JST on March 11 (just after the quake), the petrochemical complex in Chiba caught fire.
Rumor: Around 19:00 JST on March 11, the following two tweets were posted and retweeted:
- Radiation and harmful chemicals are leaking into the air from the petrochemical complex. Be careful!
- Don’t go out! The rain contains radiation and harmful materials from the petrochemical complex explosion.
Correction: Around 15:00 JST on March 12 (the day after the earthquake), the industry’s website and the local government’s twitter officially corrected the rumor.
Disappearance: At night on March 12, the topic disappeared.

“Isodine”: Isodine was good for protecting from radiation.
“Isodine” is the brand name of a mouthwash that includes iodopovidone. A rumor emerged that isodine protects people from radiation. It progressed in the following four stages:

Fact: After the nuclear plant explosion occurred, twitter users expressed fear of radiative contamination.
Rumor: Around 7:00 JST on March 12 (the day after the earthquake), the rumors about isodine’s protective benefits emerged.
Correction: Around 15:00 JST on March 12 (the day after the earthquake), the government and isodine’s manufacturer corrected the rumor.
Disappearance: At night on March 12, the topic gradually disappeared.

Analyzing Temporal Patterns of Topic Diversity

We examine the temporal patterns of topic diversity of tweets including 10 keywords (e.g., topics) from 0:00 JST, 11 th March to 0:00 JST, 14 th March. We calculated the topic diversity for each 30 minutes widow by the following procedure. First, the word count matrices were constructed from the collected tweets (Section 3.1). Second, the tweet graph was generated (Section 3.2) and the topic diversity (i.e., the number of subtopics) was obtained by applying Data polishing algorithm to each graph (Section 3.3).

Figure 3 shows log $-$ log scatter plot of the number of tweets and the topic diversity for 10 topics and for all the dataset. Each circle represents the number of tweets and the topic diversity at a time window. The color in a circle represents the time: the color changes from white to dark blue as the time passes. For all the dataset (Figure 3A) and non-rumor topics (Figure 3B $-$ I), we observe that the topic diversity is highly correlated with the number of tweets (Pearson correlation coefficient ranged from 0.953 to 0.996). This result suggests the power law relationship between the topic diversity and the number of tweets. For rumor topics (Figure 3J and K), we observe that the scatter plot disperses from the linear fit compared to the non-rumor ones. The topic diversity is less correlated with the number of tweets (Pearson correlation coefficient 0.958 and 0.938 for “Cosmo Oil” and “Isodine”, respectively).

To quantify the dispersion in the scatter plot, we applied the quantile regression[35,36] that fits the top and bottom $5\%$ data point. Figure 3 depicts the top and bottom $5\%$ regression lines (in blue and red, respectively) and their slopes. These slopes are close for all the dataset (Figure 3A) and non-rumor topics (Figure 3B $-$ I). In contrast, these slopes are different rumor topics (Figure 3J and K): 0.73 vs 0.59 and 0.83 vs 0.60 for “Cosmo Oil” and “Isodine”, respectively. This result indicates that the scatter plot of the rumor topics disperses compared the non-rumor topics, which is consistent with the correlation analysis. In addition, we found that the slopes of the general topic “do” (“suru” in Japanese) was smaller than that of the specific ones. A possible reason could be that 1) there are a huge difference in the number of tweets between “do” and the other topic, and 2) the slope tends to decrease when the number of tweets is large. Thus, the slope may not be useful for comparison of two topics when the number of tweets in each topic is not comparable.

**Figure 3:** The number of tweets (# of Tweets) vs the number of subtopics (# of Subtopics) for all the dataset (A) and for 10 keywords (BK). The slope of the top 5% and bottom 5% regression lines are shown in blue and red, respectively.
$\includegraphics[width=6in]{Fig3.pdf}$

Next, we observe how the topic diversity of the rumor topics changed before and after the earthquake. Figure 4A shows the time course of the topic diversity of the rumor about “Cosmo Oil”. We can see that the topic was not popular before the earthquake. While the number of tweets increased dramatically just after the earthquake, the topic diversity does not increase much, showing that the topic burst with relatively homogenous opinions, or low diversity. After the rumor correction was issued, both the number of tweets and the topic diversity increased and the topic diversity grew higher than that before the rumor correction. Figure 4B shows the time course of the topic diversity of the rumor about “Isodine”. We can see that the temporal pattern of the topic diversity is similar to that of “Cosmo Oil”. Interestingly, both rumor topics spread with low topic diversity and they were corrected with high topic diversity.

**Figure 4:** Time course of Topic Diversity. Cosmo Oil (A) and Isodines (B).
$\includegraphics[width=6in]{Diversity3.pdf}$

Performance Evaluation

In this section, we examine the runtime and the size of subtopics generated by the proposed method and existing methods.

Runtime

We examine the runtime for processing the tweet dataset by the proposed method based on Data polishing (Section 3.3), and compare the performance with four existing methods for clustering subtopics: LDA [12], K-means [37], MeanShift [38], and Agglomerative clustering [39]. LDA is the most popular topic model algorithm based on word frequency across documents. K-means, MeanShift, and Agglomerative clustering are general clustering algorithms that were applied to word vectors $\vec{w}_j= (w_{j1}, w_{j2}, \cdots, w_{jn})$ ( $j= 1, 2, \cdots , m$ ) (Section 3.1). Data polishing is a Graph clustering approach, and was applied to the tweet graph (Section 3.2). We used Nysol Python [40]’s implementation of Data polishing, and scikit-learn’s implementations of K-means, MeanShift, and Agglomerative Clustering [41], and LDA implemented in Python Gensim [42]. All our experiments were performed on a 2018 13-inch MacBook Pro, with a 2.7 GHz Intel Core i7 with 16 GB 2133 MHz of memory.

Figure 5 illustrates the performance of the algorithms, as the number of tweets was increased from 2,000 to 30,000. These tweets were randomly extracted from the Tweet data posted during 18:30-19:00 JST on March 11. The runtime was measured by executing each method for three times and calculating their average. Note that we stopped the measurement if the runtime reached 2 hours (7,200 secs). While Data polishing and MeanShift can automatically determine the number of subtopics from the data, the other methods cannot do it. For these methods, the number of subtopics was assigned as the number that Data polishing determined. The numbers of subtopics estimated by Data polishing were smaller than those by MeanShift. The result shows that Data polishing is more efficient than the existing algorithms. The second most efficient one is LDA. The proposed method is five times faster than LDA for the sample size of 20,000 and at least six times faster than LDA for the sample size of 30,000. LDA code was terminated because the runtime exceeded 2 hours.

**Figure 5:** Performance Evaluation. The runtime of the proposed method (Data Polishing: blue) was compared with four existing methods (MeanShift: orange, K-means: black, Agglomerative: green, and LDA: yellow). We stopped the measurement if the runtime reached 2 hours (7,200 secs).
$\includegraphics[width=6in]{RunTime.pdf}$

Number of Tweets in Top 5 Largest Subtopics

We randomly extracted 10,000 tweets posted during 18:30-19:00 JST on March 11, and analyzed the Tweet data by using the proposed method and the existing methods (LDA, K-means, MeanShift, and Agglomerative clustering). Table 1 shows the number of tweets in the top 5 largest subtopics. K-means and MeanShift each generated one large cluster (more than 14 % of the total tweet) as a largest subtopic. Typically, these method tend to result in one huge subtopic and many small subtopics, which often leads to a trivial clustering or difficulty for interpreting the results. In contrast, the proposed method (Data polishing) generated the smallest cluster as the largest subtopic ( $\sim$ 4 % of the total tweets). Though LDA and Agglomerative clustering’s largest subtopics were smaller than K-means’ and MeanShift’s, they require specifying the number of subtopics ahead of time.

**Table 1:** The number of tweets for Top 5 Largest subtopics. Data polishing (proposed method), MeanShift, and their results are shown by bold letters, because these methods can automatically determine the number of subtopics.
$\begin{tabular}[t]{c\vert c\vert c\vert c\vert c\vert c} Rank & {\bf Data polis... ... {\bf 28} & 63 \\ 5 & {\bf 123} & 172 & 145 & {\bf 23} & 59 \\ \end{tabular}$

Conclusion

In this paper, we have proposed a method for analyzing the topic diversity using Graph Clustering. After generating the tweet graph based on the similarity between tweets, we use Data polishing algorithm to obtain the clusters in the graph. We interpret a cluster as a subtopic and the number of clusters as the topic diversity, i.e., the number of subtopics in a target topic. This method was applied to a dataset of millions of tweets posted before and after the Great East Japan Earthquake of 2011. The proposed method is useful for detecting a low topic diversity situations and the rumor diffusion associates with the burst with low topic diversity, or homogeneous opinions. We have confirmed that our approach outperformed other existing clustering approaches (LDA, K-means, MeanShift, and Agglomerative clustering) in running time and results. Our method has significant applications in that it could, for example, companies or celebrities about newly-circulating rumors, and help them avoid bad publicity on social media.

The main limitation is that we focused on only 10 topics in this study. We have performed a case study of rumor diffusion on Twitter during a disaster rather than a general study of rumor diffusion. We are planning to apply the proposed method to another rumor dataset [43] and systematically investigate temporal patterns in topic diversity towards the development of a practical rumor detection algorithm.

acknowledgements

This work was partially supported by JST CREST JPMJCR1401, JST ACT-I JPMJPR16UC, JST PRESTO JPMJPR1925, JSPS KAKENHI JP17H03279, JP18K11560, JP19H01133, JP19K12125, JP18K11443 and JP17H00762.

Bibliography

1: Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon.
What is twitter, a social network or a news media?
In Proceedings of the 19th international conference on World Wide Web, pages 591–600, 2010.
2: Kevin Dela Rosa, Rushin Shah, Bo Lin, Anatole Gershman, and Robert Frederking.
Topical clustering of tweets.
Proceedings of the ACM SIGIR: SWSM, 63, 2011.
3: Yasuko Matsubara, Yasushi Sakurai, B Aditya Prakash, Lei Li, and Christos Faloutsos.
Rise and fall patterns of information diffusion: model and implications.
In Proceedings of the 18th ACM SIGKDD international conference on Knowledge Discovery and Data mining, pages 6–14, 2012.
4: Takako Hashimoto, Dave Shepard, Tetsuji Kuboyama, and Kilho Shin.
Event detection from millions of tweets related to the great east japan earthquake using feature selection technique.
In 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pages 7–12. IEEE, 2015.
5: Keisuke Ikeda, Takeshi Sakaki, Fujio Toriumi, and Satoshi Kurihara.
An examination of a novel information diffusion model: Considering of twitter user and twitter system features.
In International Conference on Autonomous Agents and Multiagent Systems, pages 180–191. Springer, 2016.
6: Ryota Kobayashi and Renaud Lambiotte.
Tideh: Time-dependent hawkes process for predicting retweet dynamics.
In Tenth International AAAI Conference on Web and Social Media, 2016.
7: Przemyslaw A Grabowicz, Niloy Ganguly, and Krishna P Gummadi.
Distinguishing between topical and non-topical information diffusion mechanisms in social media.
In Tenth International AAAI Conference on Web and Social Media, 2016.
8: Hongshan Jin, Masashi Toyoda, and Naoki Yoshinaga.
Can cross-lingual information cascades be predicted on twitter?
In International Conference on Social Informatics, pages 457–472. Springer, 2017.
9: Julia Proskurnia, Przemyslaw Grabowicz, Ryota Kobayashi, Carlos Castillo, Philippe Cudré-Mauroux, and Karl Aberer.
Predicting the success of online petitions leveraging multidimensional time-series.
In Proceedings of the 26th International Conference on World Wide Web, pages 755–764, 2017.
10: Takako Hashimoto, Takeaki Uno, Tetsuji Kuboyama, Kilho Shin, and Dave Shepard.
Time series topic transition based on micro-clustering.
In 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 1–8. IEEE, 2019.
11: David Ifeoluwa Adelani, Ryota Kobayashi, Ingmar Weber, and Przemyslaw A Grabowicz.
Estimating community feedback effect on topic choice in social media with predictive modeling.
EPJ Data Science, 9(1):25, 2020.
12: David M Blei, Andrew Y Ng, and Michael I Jordan.
Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022, 2003.
13: Takeaki Uno, Hiroki Maegawa, Takanobu Nakahara, Yukinobu Hamuro, Ryo Yoshinaka, and Makoto Tatsuta.
Micro-clustering: Finding small clusters in large diversity.
arXiv preprint arXiv:1507.03067, 2015.
14: Takeaki Uno, Hiroki Maegawa, Takanobu Nakahara, Yukinobu Hamuro, Ryo Yoshinaka, and Makoto Tatsuta.
Micro-clustering by data polishing.
In 2017 IEEE International Conference on Big Data (Big Data), pages 1012–1018. IEEE, 2017.
15: David M Blei and John D Lafferty.
Dynamic topic models.
In Proceedings of the 23rd international conference on Machine learning, pages 113–120, 2006.
16: Yu Wang, Eugene Agichtein, and Michele Benzi.
Tm-lda: efficient online modeling of latent topic transitions in social media.
In Proceedings of the 18th ACM SIGKDD international conference on Knowledge Discovery and Data mining, pages 123–131, 2012.
17: Hristo Tanev, Maud Ehrmann, Jakub Piskorski, and Vanni Zavarella.
Enhancing event descriptions through twitter mining.
In Sixth International AAAI Conference on Weblogs and Social Media, 2012.
18: Eli Pariser.
The filter bubble: What the Internet is hiding from you.
Penguin UK, 2011.
19: Seth Flaxman, Sharad Goel, and Justin M Rao.
Filter bubbles, echo chambers, and online news consumption.
Public opinion quarterly, 80(S1):298–320, 2016.
20: Cornelius Puschmann.
Beyond the bubble: Assessing the diversity of political search results.
Digital Journalism, 7(6):824–843, 2019.
21: Julia Stoyanovich, Ke Yang, and HV Jagadish.
Online set selection with fairness and diversity constraints.
In Proceedings of the EDBT Conference, 2018.
22: Maxim Charkov and Surabhi Gupta.
Re-ranking search results for location refining and diversity, July 11 2019.
US Patent App. 16/356,811.
23: Sejeong Kwon, Meeyoung Cha, Kyomin Jung, Wei Chen, and Yajun Wang.
Prominent features of rumor propagation in online social media.
In 2013 IEEE 13th International Conference on Data Mining, pages 1103–1108. IEEE, 2013.
24: Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz Zubiaga.
SemEval-2017 task 8: RumourEval: Determining rumour veracity and support for rumours.
In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 69–76, 2017.
25: Lahari Poddar, Wynne Hsu, Mong Li Lee, and Shruti Subramaniyam.
Predicting stances in twitter conversations for detecting veracity of rumors: A neural approach.
In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pages 65–72. IEEE, 2018.
26: Jing Ma, Wei Gao, and Kam-Fai Wong.
Rumor detection on twitter with tree-structured recursive neural networks.
In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1980–1989, 2018.
27: Zen Yoshida and Masayoshi Aritsugi.
Rumor detection in twitter with social graph structures.
In Third International Congress on Information and Communication Technology, pages 589–598. Springer, 2019.
28: Shenkai Gu, Ran Cheng, and Yaochu Jin.
Feature selection for high-dimensional classification using a competitive swarm optimizer.
Soft Computing, 22(3):811–822, 2018.
29: Adam P Piotrowski and Jaroslaw J Napiorkowski.
Some metaheuristics should be simplified.
Information Sciences, 427:32–62, 2018.
30: Taku Kudo.
Mecab: Yet another part-of-speech and morphological analyzer.
http://mecab. sourceforge. jp, 2006.
31: Paul Jaccard.
The distribution of the flora in the alpine zone. 1.
New phytologist, 11(2):37–50, 1912.
32: Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan.
Introduction to information retrieval.
In Proceedings of the international communication of association for computing machinery conference, page 260, 2008.
33: Hottolink, inc.
http://www.hottolink.co.jp/english/, 2020.
34: Keita Nabeshima, Junta Mizuno, Naoaki Okazaki, and Kentaro Inui.
Mining false information on twitter for a major disaster situation.
In International Conference on Active Media Technology, pages 96–109, 2014.
35: Roger Koenker and Kevin F Hallock.
Quantile regression.
Journal of economic perspectives, 15(4):143–156, 2001.
36: Quantile regression.
https://cran.r-project.org/web/packages/quantreg/quantreg.pdf, 2019.
37: David Arthur and Sergei Vassilvitskii.
K-means++: The advantages of careful seeding.
In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035, 2007.
38: Keinosuke Fukunaga and Larry Hostetler.
The estimation of the gradient of a density function, with applications in pattern recognition.
IEEE Transactions on information theory, 21(1):32–40, 1975.
39: Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
The elements of statistical learning: data mining, inference, and prediction.
Springer Science & Business Media, 2009.
40: NYSOL Python.
https://www.nysol.jp/, 2020.
41: Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al.
Scikit-learn: Machine learning in python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
42: Radim Řehůřek and Petr Sojka.
Software Framework for Topic Modelling with Large Corpora.
In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, 2010.
43: Quanzhi Li, Qiong Zhang, Luo Si, and Yingchi Liu.
Rumor detection on social media: Datasets, methods and opportunities.
In Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda, pages 66–75, 2019.