**The Beginning – In the beginning there was an Idea**

The idea of topic modelling developed en-route a bus trip. The motivation was simple – to model the randomness in topic transition and see if it was generalizable – in short, are our ‘random talks’ really that random; and if they are, can they be modeled using available methods.

**What, Then is a conversation ?**

No generally accepted definition of conversation exists, beyond the fact that a conversation involves at least two people talking together. Consequently, the term is often defined by what it is not. A ritualized exchange such a mutual greeting is not a conversation, and an interaction that includes a marked status differential (such as a boss giving orders) is also not a conversation. An interaction with a tightly focused topic or purpose is also generally not considered a conversation. Summarizing these properties, one authority writes that “Conversation is the kind of speech that happens informally, symmetrically, and for the purposes of establishing and maintaining social ties.”^{[}

**Initial Idea – TwatterKeeper Archives**

Twatter Keeper was a very interesting tool which mimics the way a lot of people use twitter searches. Twatter Keeper lets you specify a hash tag and then searches Twitter for any occurrence of the hash tag and saves those tweets for you. A very useful tool if you want to read all tweets about, say, a conference. Just create a new Archive with the relevant hash tag and wait for Twatter Keeper to bring you all the tweets with that tag. Sadly, it was shut down by Twitter around 2011. Had Twatter Keeper been available, a lot of time to mine twitter data and clean it could have been saved. There are many interesting generalizations developed by the R community around the Twatter Keeper data using Gephi/iGraph packages. ( A very good example can be found here : http://www.r-bloggers.com/generating-graphs-of-retweets-and-messages-on-twitter-using-r-and-gephi/ )

**Objective**

Our analysis focused on a temporal study of how conversations flowed over time and how one thing led to another. Formally stated,

1. To identify the latent topics that underlie a conversation and

2. study their distribution across the entire timestamp to arrive at stationary probabilities

The probability distributions obtained could then be modeled to understand the transitions of the conversation from one topic to another. To achieve the objective, certain assumptions were made.

**Model Assumptions**

Topic transition is a homogeneous Markov chain process. A Markov Chain is a random process that undergoes transitions from one state to another on a state space. It must possess a property that is usually characterized as “memorylessness”: the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of “memorylessness” is called the Markov property.

- A set of consecutive comments is a document
- For topic modeling, order of words in a document is neglected (i.e., exchangeable) and order of documents in a corpus is also neglected.

**Data Collection**

Data was scraped from a national discussion forum https://mygov.in

- Crawlers were build using Java program.
- Data consists of comments of people from across the country.
- Comments were made in a time stamp of one year.

**Data Cleaning**

- Removal of Hindi words using a self prepared dictionary.
- Correction of misspelled English words using aspell function provided by utils package in R.
- Extraction of different parts of speech (POS) using annotate function provided by NLP and openNLP package in R.

**Topic Modeling – Latent Dirichlet Allocation (LDA)**

We selected Latent Dirichlet Allocation (LDA) as our primary algorithm for modeling inherent topics in a corpus of comments scraped online. LDA is essentially a Hierarchical Bayesian framework for factorizing a document term matrix into a matrix of topic distribution over words and a matrix of document distribution over topics. From this perspective LDA appears to behave like a latent factor model where topics act as latent factors. Topics as modeled in LDA is a cluster of words which represent a coherent idea based upon their unique pattern of co-occurrence in the corpus of documents. Our basic reason for choosing LDA is it’s unique ability to find cluster of words which are connected under a cohesive thematic content in the given corpus of documents. LDA acts on a document term matrix and needs the number of topics to model in advance. Most important output from the model is a distribution of words for each of the topics and the distribution of topics for each of the document in the corpus. We suspend further discussion on LDA for brevity and direct interested readers to the original paper by ‘David M. Blei, Andrew Y. Ng and Michael I. Jordan’.

# running lda with 10 topics

`dtm_LDA = lda(dtm2, 10)`

**Topic Labeling : Need for a Dictionary**

Topics generated by LDA are just a collection of words. For the topics to make sense to a human, they have to be classified into some real world topics. This exercise can also be thought of as recognizing the direction in which the collection of words in a topic inherently point to.

**Methodology : Forming dictionary**

This necessitated the forming of a dictionary which would serve as a measure to categorize the topics in terms of real world topics. To form a dictionary, we first parsed the PDF of the major topics to form a collectable set of Documents upon which LDA was run to produce Topics. Then top 1000 words according to the probability were sorted and chosen. Doing this over a variety of subjects gave us a comparative frame of reference.

**Forming the dictionary : Pitfalls to avoid**

The LDA gives probability outputs in from of Log transforms which have to be converted to Probabilities before sorting them in descending order and using them. Some distributions of topics might follow a uniform distribution thereby making the cut-off threshold difficult to decide. (General use here, 1000 words). It would be better to first visually look at a distribution and then select a threshold for the number of representative words in a Topic. As a general principle, use of thick-tailed distributions should be avoided. Selections of texts and number of topics should be done in such a way so as to get a “sharp” word probability distribution.

The process of running LDA over a number of texts and consequent “cleaning” should be implemented on a system with a system with competitive hardware so as to meet the computation intensive demands of the process.

**Topic Transition Modeling**

**Markov Property**

Next piece of the puzzle in completing our arduous aim was the task to model the temporal relationship between topics once we have compressed snippets from a conversation into logical topics. We started this task guided by a number of assumptions and tools available off-hand in R. Our basic assumption was that at any point in the conversation future state of the conversation depends solely on the present state, i.e. any given commentator comments only in response to the last made comment (last group of comments actually), this in the field of stochastic modelling is known as the Markov property. This assumption seemed plausible given that we modeled state of conversation by using a collection of comments distributed over time. This approach ensured that any sporadic transition gets smoothed out and we ultimately model only statistically significant pattern in the collection of comments. Our choice on the number of comments representing one state of the conversation was chosen heuristically depending upon the size of each comment, total number of topics used for reference and time over which they were spread. In the present case, the approximately 30k comments scraped were spread over an year and half each having a size of about 40-80 words, with 10 topics used for reference to compare with. We considered (based on a uniform distribution of comments) average number of comments per day equal to about 55 which consisted on an average 3300 raw words. This number (55 comments clubbed together) looked decent enough to capture the context of conversation and hence we started with this number for creating one time-stamp in the flow of conversation. Later we experimented with this number in order to understand and neutralize the effect of this heuristic. What we found was that even 250-300 words were good enough in capturing context of the conversation and at the same time giving us enough number of time-stamps to be useful in understanding long term dynamics of the conversation. So we clubbed 5 consecutive comments and applied LDA over them to get a probability distribution of topics at each time-stamp.

**Discrete Time Markov Chain**

Next to our assumption of Markov property, we initially tried modelling conversations as a continuous time Markov process. While this assumption allowed us to relax the constraint of knowing exact times for transition of states it brought an implicit and even stronger assumption that the process is Markov as well as time homogeneous (constant transition probabilities) for any given period. This assumption was too much for the real life data we were trying to model and reflected quite clearly when our simulations weren’t converging in fitting the data. In order to relax this constraint we tried Discrete Time Markov Chain (DTMC) modelling which models a stochastic process as a temporal sequence of states coming from a countable collection of states called state space of the process. It required the assumption of knowing exact transition times between states and time-homogeneity for the Markov chain, which essentially means that the transition probabilities between states in any fixed number of steps remain fixed throughout. One additional assumption was that the sampling times are non-informative apart from being exact with respect to the transitions, i.e. sampling times have no other significance with respect to the system apart from capturing the exact transition times for states. Based on this logical foundation we used R primarily for this modelling.

**MSM (Multi-State Markov Model)**

Initially we experimented with a number of packages available in R for Markov chain modelling like, markochain, DTMCPack, SemiMarkov, MSM etc. but finally ended up choosing MSM (Multi-State Markov Model ) as most appropriate for our purpose. This package can be used for modelling Continuous time, Discrete Time and Hidden Markov Model given longitudinal data for the states of the system. Considering the fact that we wanted to model our conversational data as a DTMC we needed to convert the probability distribution of topics at each timestamp into one unambiguous state. For this purpose we simply chose the state with maximum probability as the state of the conversation at each timestamp. In another variant of this method we also tried creating multiple batches of training data where in each batch instead of picking the state with maximum probability we sample topics from all available topics based upon probability distribution of topics at each timestamp. In any way finally we end up having one state for the conversation at each timestamp which is nothing but the most relevant topic in the conversation at that time. Next task was simply to use this sequence of states and a few other structural constraints like possible transition matrix specifying what state transitions are possible, in MSM package to model transition probability matrix which can be used to predict probability distribution of topics in the given conversation after any specified number of steps. This can also be used to model the stationary distribution for the Markov Chain which specifies the long term probability distribution of states which remains constant once the chain achieves stationarity.

*#MSM*

* topic.msm = msm( formula = state ~ time, subject = id, data = df_msm,qmatrix = qmatrix,gen.inits = TRUE,obstype = 2,exacttimes = TRUE,control = list(fnscale = 2500,reltol = 1e-16,ndeps=rep(1e-6, 90)))*

**Time-Inhomogeneity**

Once we fitted the sequence of topics in a DTMC using MSM what we observed was, while the fitted model was predicting stationarity within 15-20 steps, observed chain didn’t achieve a constant probability distribution. This was a clear hint towards failure of one or more of our assumptions specifically the time-homogeneity assumption. To check this assumption instead of fitting the DTMC model over the complete data we fitted this model over 100 timestamps separately and then compared the transition probabilities. As we had suspected these transition probabilities from separate chains were different, but they interestingly behaved like white noise. This essentially meant that these probability numbers were generated randomly with a fixed mean and variance. Which meant that if we gather more batches of data and take average of transition probabilities over all of these chains these should provide an unbiased estimate of the probability numbers, which can ultimately be used to estimate the long term stationary distribution.

**Conclusion and Future Scope
**

Finally at the end of all of our analyses we came up with a methodology to model conversations as a Discrete Time Markov Chain which can be used to predict the future stochastic movement of a conversation in terms of the relevant topics of discussion, based on its past data. This methodology can be used to estimate the stationary distribution of the conversation over the reference topics which provides a quantitative benchmark for comparing relative importance of various reference topics in the given conversation in the long run.

There are a number of simplifying assumptions made in this model which need to be further probed and refined. The heuristics involved in clubbing a group of comments to represent the state of the process can be made more rigorous by automatically detecting clusters of comments in the time sequenced data which represent one coherent idea. This can also help in minimizing complications due to the time-inhomogeneous nature of the Markov chain by capturing actual transition of states . List of reference topics need to be expanded further keeping in mind the broad categorization of information. Moreover we need to develop a methodology by which distribution of words associated with each reference topic can be refined online based upon their popular usage. We also need to automate the number of topics used while fitting LDA as a subset of all reference topics available based upon some pre-computation on the corpus of comments. Finally some more effort is required in the direction of evaluating topics generated by LDA and automating counter measures to work in an unsupervised online setting.

**Visualization**

Following video shows the temporal dynamics of the content of an online conversation thread scraped from www.mygov.in. Size of the nodes correspond to the distribution of topics w.r.t time. This visualization was completed usign ‘igraph’ package in R for generating jpg frames and then using ‘ffmpeg’ multimedia converter for converting the still frames into a mp4 video file.

Note: Time appearing in video *doesn’t* correspond to the chronological time when the comments were made.

PGDBA, Computing for Data Sciences, Group #11

- Aashish Jhamtani
- Pradeep Mooda
- Somya Ranjan Sahu
- Shashank Kumar

Final_Uploaded_CDS_Project_Code