Data Science Certification in Pune

Views:
 
Category: Education
     
 

Presentation Description

With booming scope Data Science Certification in Pune at Excelr is continuously pushing its boundaries and reaching more and more people. Excelr provides job assistances and placement support all through. Data Science Course in Pune, Data Science Course, Data Science certification, data science certification in Pune, Data Science Training, Data science training in Pune.

Comments

Presentation Transcript

slide 1:

© 2013 ExcelR Solutions. All Rights Reserved Text Mining Clustering

slide 2:

© 2013 ExcelR Solutions. All Rights Reserved Text Mining - Importance • Avenues of textual unstructured data − Call transcripts − Email to customer service − Social media outreach − Speech transcripts − Field agents salespeople − Interviews surveys Structured 20 Unstructured 80

slide 3:

© 2013 ExcelR Solutions. All Rights Reserved Bag-of-Words All the world’s a stage and all the men and women merely players: They have their exits and their entrances And one man in his time plays many parts…” ENGLISH Professor Statistician World Stage Men Women Play Exit Entrance time 1 1 2 1 2 1 1 1

slide 4:

© 2013 ExcelR Solutions. All Rights Reserved Terminology Pre-processing • Each row is called as a ‘Document’ even an empty row is considered as a document • Collection of all these documents is called as ‘Corpus’ • Quirks of languages − Terms with typos e.g. ‘musc’ − Terms in lowercase proper case uppercase e.g. usb Usb USB − Punctuations special symbols ‘’ ‘’ ‘’ etc. − Filler words connectors pronouns ‘all’ ‘for’ ‘of’ ‘my’ ‘to’ etc. • Stemming – process of considering only stem words e.g. jumping jumped stem-word here is ‘jump’ Let me show you “Amazon customer reviews”

slide 5:

© 2013 ExcelR Solutions. All Rights Reserved DTM TDM • Let us understand 100 document corpus of Xbox TF - Regular term counts TFIDF - Discounts the TF by document frequency DTM weighing

slide 6:

© 2013 ExcelR Solutions. All Rights Reserved Corpus-Level Word Cloud

slide 7:

© 2013 ExcelR Solutions. All Rights Reserved Positive Word Cloud

slide 8:

© 2013 ExcelR Solutions. All Rights Reserved Negative Word Cloud

slide 9:

© 2013 ExcelR Solutions. All Rights Reserved Clinical Trials Project

slide 10:

© 2013 ExcelR Solutions. All Rights Reserved Clinical Trials – Text Mining Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 • Stage 1: Animals • Stage 2: Humans - very few with that specific disease • Stage 3: Humans - who have other diseases • Stage 4: Humans - larger audience • Stage 5: US FDA • Stage 6: Adverse events Stages Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6

slide 11:

© 2013 ExcelR Solutions. All Rights Reserved Clinical Trials – Project in brief Business Objective: Increase the success rate of the clinical trials Project Brief Description:  Phase 1: Collected the data from open source forums such as “https://clinicaltrials.gov/”  Phase 2: Data Cleansing on XML files by extracting relevant fields from the clinical trials  Phase 3: Segregated the data into Structured Unstructured data  Phase 4: Performed Word Cloud Sentiment Analysis on unstructured data to identify the reasons for termination of clinical trials Techniques used: Term Frequency TF Term Frequency Inverse Document Frequency TFIDF Positive Negative Word cloud Dendrogram Semantic Network k-Means clustering

slide 12:

© 2013 ExcelR Solutions. All Rights Reserved • Key words standing out of the rest are Accrual Enrollment Slow Safety Efficacy Sponsor Lack Low etc. • These words should be seen in the context to gain business value • When we see this word cloud in conjunction with dendrogram we notice that slow accrual slow enrollment poor efficacy sponsor funding seem to be the broad themes for termination of clinical trials Unigram Word Cloud Dendrogram

slide 13:

© 2013 ExcelR Solutions. All Rights Reserved Semantic network Bigram • Semantic network shows that the relationship between the words the key themes mentioned in previous slide are becoming relevant • One key thing is safety concerns. At the first sight it sounds as if safety concerns were reason for termination but when we see it in context more termination reasons say that there are “No Safety Concerns” • Bi-gram is used to see 2 words to extract business value the key themes mentioned earlier are more evident here Bi-gram Word Cloud Semantic Network

slide 14:

© 2013 ExcelR Solutions. All Rights Reserved • Scree-plot or elbow plot shows that there is a clear bend at 2 clusters hence we are considering that there are 2 clusters categories that the data can be segregated into Note: Analysis is done considering slight bend at 2 nd cluster and considering steep bend at 4 th cluster however it did not provide any meaningful insights K-Means Clustering Scree Plot

slide 15:

© 2013 ExcelR Solutions. All Rights Reserved • Word cloud is clearly highlighting that this cluster is speaking majorly about Accrual: Term referring to the number of patients in a study or clinical trial • Even the dendrogram clearly shows Accrual Enrollment Slow as a major cluster Word Cloud Dendrogram - First Cluster

slide 16:

© 2013 ExcelR Solutions. All Rights Reserved • Few key highlights from this word cloud are early premature termination • Dendrogram mentions majority of things related to premature closure of clinical trials Word Cloud Dendrogram - Second Cluster

slide 17:

© 2013 ExcelR Solutions. All Rights Reserved Web Social Media Extraction

slide 18:

© 2013 ExcelR Solutions. All Rights Reserved NLP - Agenda 01 02 03 04 05 LDA in Text Mining Topic extraction using LDA Structured information extraction Sentiment extraction in a narrative Lexicons Emotion Mining

slide 19:

© 2013 ExcelR Solutions. All Rights Reserved NLP Data collection/ Information Retrieval Feature extraction Lexical analysis/ Entity analysis Cleaning/ Normalization Extraction of insight

slide 20:

© 2013 ExcelR Solutions. All Rights Reserved Latent Dirichlet Allocation LDA It assumes that each document is a mixture of a small number of topics Each document may be viewed as a mixture of various topics Each word’s presence is attributable to one of the document’s topics LDA can be viewed as a Bayesian model where each item is modeled as a result of a mixture of underlying set of topics LDA is a generative model a model for randomly generating observable data values Observations are words collected into documents

slide 21:

© 2013 ExcelR Solutions. All Rights Reserved Latent Dirichlet allocation LDA Vs. Clustering • Unsupervised learning algorithms • Mixture model where a document can be assigned to one or more topics • Each topic is a culmination of multiple documents • Unsupervised learning algorithms • Specify an optimal ‘k’ that allows us to extract topics or segments from the data • Does a raw partition of the data • Resultant clusters are disjoint from each other LDA Clustering K-means • A popular example using term usage • A man sees a boy with a telescope • Who has the telescope • In this example a term’s usage leads to confusion owing to its placement • In the same way a sentence from a corpus could infer a different meaning in conjunction with another sentence

slide 22:

© 2013 ExcelR Solutions. All Rights Reserved • Many sources of data contain large amount of artifacts that lend a lot of information • Text data can be subjected to methods that can help mine structured information • This is information retrieval using previously generated labeled data Structured data extraction Raw Text 1 Parser 2 Names Entities 3

slide 23:

© 2013 ExcelR Solutions. All Rights Reserved • Lexicons serve as dictionaries for extracting sentiment from raw unlabeled data • These are useful in estimating semantic orientation polarity • They are applied to polarity prediction tasks and serve as a bag of words that help assign a score/label to terms in text Three Lexicons used in this session are: – Bing Developed by Professor Bing Liu – AFINN Informatics and Mathematical Modelling Technical University of Denmark – NRC Dr. Saif M. Mohammad Lexicons

slide 24:

© 2013 ExcelR Solutions. All Rights Reserved This methodology allows the extraction of the most negative to positive sentiment bearing documents from text: • negative - s_vwhich.minafinn_s_v • negative • 1 "I fully agree with you. This is the worst card ever and they are running a late charge fee scam here. This is what they do - on my first statement I made full payment. On the second statement they charged me a late payment fee and interest. I wrote back to them to tell them they made mistake but never heard back from them. I made full payment on the second statement which was around 26 and when the third statement came I was charged with another late payment fee and compounded interests. This time I got on the phone and spoke to some guy in India who starts off each statement with \"So how do you want to make payment sir\" After something like hearing that damn statement for the 10th time I got so pissed with trying to get an explanation on the late fees I was transferred to a supervisor who then said that my first payment got rejected. I then asked why was it rejected for which they had no clue and said that I should check with my bank but got no where but what I was really pissed about was that the unpaid amount due in the first statement was never reflected on the second statement like other typical credit cards from REPUTABLE banks. Had this been done I would have known and paid in full with the second payment In the end they said that they would waive the interests which was only a few dollars but gad to charge me the late fee of 35 per month 70 in total. Can you imagine if they run this \"sweat shop\" practice from India and suckered in 10000 people with this practice That would have been 700000 into their pockets without breaking a SWEAT Worst still they filed my 2 months delinquencies with the credit unions and I lost 50-70 credit score points Bastards ” • positive - s_vwhich.maxafinn_s_v • positive • 1 "\"Well you got me beat. I have 11 currently. However I do have several Chase cards CSP Marriott Rewards and the Freedom and to be honest I rarely use my Amazon.com Rewards due to better reward options elsewhere Barclays SallieMae is the best card for Amazon.com purchases. 5 back on up to 750 in purchases each month. It also has 5 on gas/groceries on up to 250 on purchases in each category per month. I really do not like the small cap of 250 on groceries and use Amex BCE for that purpose but its still an extra benefit to have." Emotion Mining

slide 25:

© 2013 ExcelR Solutions. All Rights Reserved  This method allows for generating narrative time for the text and generates how the positive and negative emotional valence has been in the corpus  The nrc framework allows to mine text that match with eight emotions in the writing: 1. Trust 2. Anticipation 3. Joy 4. Sadness 5. Fear 6. Anger 7. Surprise 8. Disgust  The same can be implemented for the different sources from which content is generated for a client and the outcomes can be compared and evaluated for understating which source provides what form of understanding Arcs Emotion

slide 26:

© 2013 ExcelR Solutions. All Rights Reserved THANK YOU

authorStream Live Help