slide 1: © 2013 ExcelR Solutions. All Rights Reserved
Text Mining
Clustering
slide 2: © 2013 ExcelR Solutions. All Rights Reserved
Text Mining - Importance
• Avenues of textual unstructured data
− Call transcripts
− Email to customer service
− Social media outreach
− Speech transcripts
− Field agents salespeople
− Interviews surveys
Structured
20
Unstructured
80
slide 3: © 2013 ExcelR Solutions. All Rights Reserved
Bag-of-Words
All the world’s a stage and all the men and women merely players:
They have their exits and their entrances
And one man in his time plays many parts…”
ENGLISH
Professor
Statistician
World Stage Men Women Play Exit Entrance time
1 1 2 1 2 1 1 1
slide 4: © 2013 ExcelR Solutions. All Rights Reserved
Terminology Pre-processing
• Each row is called as a ‘Document’ even an empty row is considered as a document
• Collection of all these documents is called as ‘Corpus’
• Quirks of languages
− Terms with typos e.g. ‘musc’
− Terms in lowercase proper case uppercase e.g. usb Usb USB
− Punctuations special symbols ‘’ ‘’ ‘’ etc.
− Filler words connectors pronouns ‘all’ ‘for’ ‘of’ ‘my’ ‘to’ etc.
• Stemming – process of considering only stem words e.g. jumping jumped stem-word
here is ‘jump’
Let me show you “Amazon customer reviews”
slide 5: © 2013 ExcelR Solutions. All Rights Reserved
DTM TDM
• Let us understand 100 document corpus of Xbox
TF - Regular term counts
TFIDF - Discounts the TF by document frequency
DTM weighing
slide 6: © 2013 ExcelR Solutions. All Rights Reserved
Corpus-Level Word Cloud
slide 7: © 2013 ExcelR Solutions. All Rights Reserved
Positive Word Cloud
slide 8: © 2013 ExcelR Solutions. All Rights Reserved
Negative Word Cloud
slide 9: © 2013 ExcelR Solutions. All Rights Reserved
Clinical Trials Project
slide 10: © 2013 ExcelR Solutions. All Rights Reserved
Clinical Trials – Text Mining
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Stage 6
• Stage 1: Animals
• Stage 2: Humans - very few with that specific disease
• Stage 3: Humans - who have other diseases
• Stage 4: Humans - larger audience
• Stage 5: US FDA
• Stage 6: Adverse events
Stages
Stage 1 Stage 2 Stage 3
Stage 4 Stage 5 Stage 6
slide 11: © 2013 ExcelR Solutions. All Rights Reserved
Clinical Trials – Project in brief
Business Objective: Increase the success rate of the clinical trials
Project Brief Description:
Phase 1: Collected the data from open source forums such as “https://clinicaltrials.gov/”
Phase 2: Data Cleansing on XML files by extracting relevant fields from the clinical trials
Phase 3: Segregated the data into Structured Unstructured data
Phase 4: Performed Word Cloud Sentiment Analysis on unstructured data to identify the
reasons for termination of clinical trials
Techniques used:
Term Frequency TF Term Frequency Inverse Document Frequency TFIDF Positive
Negative Word cloud Dendrogram Semantic Network k-Means clustering
slide 12: © 2013 ExcelR Solutions. All Rights Reserved
• Key words standing out of the rest are Accrual Enrollment Slow Safety Efficacy Sponsor
Lack Low etc.
• These words should be seen in the context to gain business value
• When we see this word cloud in conjunction with dendrogram we notice that slow accrual
slow enrollment poor efficacy sponsor funding seem to be the broad themes for
termination of clinical trials
Unigram Word Cloud Dendrogram
slide 13: © 2013 ExcelR Solutions. All Rights Reserved
Semantic network Bigram
• Semantic network shows that the relationship between the words the key themes
mentioned in previous slide are becoming relevant
• One key thing is safety concerns. At the first sight it sounds as if safety concerns were reason
for termination but when we see it in context more termination reasons say that there are
“No Safety Concerns”
• Bi-gram is used to see 2 words to extract business value the key themes mentioned earlier
are more evident here
Bi-gram Word Cloud Semantic Network
slide 14: © 2013 ExcelR Solutions. All Rights Reserved
• Scree-plot or elbow plot shows that there is a clear bend at 2 clusters hence we are
considering that there are 2 clusters categories that the data can be segregated into
Note: Analysis is done considering slight bend at 2
nd
cluster and considering steep bend at
4
th
cluster however it did not provide any meaningful insights
K-Means Clustering Scree Plot
slide 15: © 2013 ExcelR Solutions. All Rights Reserved
• Word cloud is clearly highlighting that this cluster is speaking majorly about Accrual:
Term referring to the number of patients in a study or clinical trial
• Even the dendrogram clearly shows Accrual Enrollment Slow as a major cluster
Word Cloud Dendrogram - First Cluster
slide 16: © 2013 ExcelR Solutions. All Rights Reserved
• Few key highlights from this word cloud are early premature termination
• Dendrogram mentions majority of things related to premature closure of clinical trials
Word Cloud Dendrogram - Second Cluster
slide 17: © 2013 ExcelR Solutions. All Rights Reserved
Web Social Media Extraction
slide 18: © 2013 ExcelR Solutions. All Rights Reserved
NLP - Agenda
01
02
03
04
05
LDA in Text Mining
Topic extraction
using LDA
Structured information
extraction
Sentiment extraction
in a narrative
Lexicons Emotion
Mining
slide 19: © 2013 ExcelR Solutions. All Rights Reserved
NLP
Data collection/
Information Retrieval
Feature extraction
Lexical analysis/
Entity analysis
Cleaning/
Normalization
Extraction of insight
slide 20: © 2013 ExcelR Solutions. All Rights Reserved
Latent Dirichlet Allocation LDA
It assumes that each
document is a
mixture of a small
number of topics
Each document may
be viewed as
a mixture of various
topics
Each word’s presence
is attributable to one
of the document’s
topics
LDA can be viewed as
a Bayesian model
where each item is
modeled as a result of
a mixture of
underlying set of
topics
LDA is a
generative model
a model for
randomly
generating
observable data
values
Observations
are words
collected into
documents
slide 21: © 2013 ExcelR Solutions. All Rights Reserved
Latent Dirichlet allocation LDA Vs. Clustering
• Unsupervised learning algorithms
• Mixture model where a document can be
assigned to one or more topics
• Each topic is a culmination of multiple
documents
• Unsupervised learning algorithms
• Specify an optimal ‘k’ that allows us to extract
topics or segments from the data
• Does a raw partition of the data
• Resultant clusters are disjoint from each other
LDA Clustering K-means
• A popular example using term usage
• A man sees a boy with a telescope
• Who has the telescope
• In this example a term’s usage leads to
confusion owing to its placement
• In the same way a sentence from a corpus
could infer a different meaning in conjunction
with another sentence
slide 22: © 2013 ExcelR Solutions. All Rights Reserved
• Many sources of data contain large amount of artifacts that lend a lot of information
• Text data can be subjected to methods that can help mine structured information
• This is information retrieval using previously generated labeled data
Structured data extraction
Raw
Text
1
Parser
2
Names
Entities
3
slide 23: © 2013 ExcelR Solutions. All Rights Reserved
• Lexicons serve as dictionaries for extracting sentiment from raw unlabeled data
• These are useful in estimating semantic orientation polarity
• They are applied to polarity prediction tasks and serve as a bag of words that
help assign a score/label to terms in text
Three Lexicons used in this session are:
– Bing Developed by Professor Bing Liu
– AFINN Informatics and Mathematical Modelling Technical University of Denmark
– NRC Dr. Saif M. Mohammad
Lexicons
slide 24: © 2013 ExcelR Solutions. All Rights Reserved
This methodology allows the extraction of the most negative to positive sentiment bearing
documents from text:
• negative - s_vwhich.minafinn_s_v
• negative
• 1 "I fully agree with you. This is the worst card ever and they are running a late charge fee scam here. This is what they do - on my first
statement I made full payment. On the second statement they charged me a late payment fee and interest. I wrote back to them to tell them
they made mistake but never heard back from them. I made full payment on the second statement which was around 26 and when the
third statement came I was charged with another late payment fee and compounded interests. This time I got on the phone and spoke to
some guy in India who starts off each statement with \"So how do you want to make payment sir\" After something like hearing that damn
statement for the 10th time I got so pissed with trying to get an explanation on the late fees I was transferred to a supervisor who then said
that my first payment got rejected. I then asked why was it rejected for which they had no clue and said that I should check with my bank
but got no where but what I was really pissed about was that the unpaid amount due in the first statement was never reflected on the second
statement like other typical credit cards from REPUTABLE banks. Had this been done I would have known and paid in full with the
second payment In the end they said that they would waive the interests which was only a few dollars but gad to charge me the late fee of
35 per month 70 in total. Can you imagine if they run this \"sweat shop\" practice from India and suckered in 10000 people with this
practice That would have been 700000 into their pockets without breaking a SWEAT Worst still they filed my 2 months delinquencies
with the credit unions and I lost 50-70 credit score points Bastards ”
• positive - s_vwhich.maxafinn_s_v
• positive
• 1 "\"Well you got me beat. I have 11 currently. However I do have several Chase cards CSP Marriott Rewards and the Freedom and
to be honest I rarely use my Amazon.com Rewards due to better reward options elsewhere Barclays SallieMae is the best card for
Amazon.com purchases. 5 back on up to 750 in purchases each month. It also has 5 on gas/groceries on up to 250 on purchases in
each category per month. I really do not like the small cap of 250 on groceries and use Amex BCE for that purpose but its still an extra
benefit to have."
Emotion Mining
slide 25: © 2013 ExcelR Solutions. All Rights Reserved
This method allows for generating narrative time for the text and
generates how the positive and negative emotional valence has
been in the corpus
The nrc framework allows to mine text that match with eight
emotions in the writing:
1. Trust
2. Anticipation
3. Joy
4. Sadness
5. Fear
6. Anger
7. Surprise
8. Disgust
The same can be implemented for the different sources from
which content is generated for a client and the outcomes can be
compared and evaluated for understating which source provides
what form of understanding
Arcs Emotion
slide 26: © 2013 ExcelR Solutions. All Rights Reserved
THANK YOU