Next Steps This concludes the exploratory analysis. The number of lines and total number of words are as follows: Unigram Analysis The first analysis we will perform is a unigram analysis. Introduction the milestone report for week 2 in the Exploratory Analysis section is from the Coursera Data Science Capstone project. The Coursera Data Science Capstone involves predictive text analytics. Exploratory Analysis Top 10 used words. The data set includes text files from various languages.

Coursera Data Science Capstone: Clean up the corpus by removing special characters, punctuation, numbers etc. This command can be used for obtaining text stats and is available on every Unix based system. Rmd, which can be found in my GitHub repository https: If none is found, then the 3-gram model is used, and so forth. The goal of the Data Science Capstone Project is to use the skills acquired in the specialization in creating an application based on a predictive model for text. For the Shiny app, the plan is to create an app with a simple interface where the user can enter a string of text.

Coursera Data Science Capstone: Week 2 Milestone Report

For each Term Document Matrix, we list the most common unigrams, bigrams, trigrams and fourgrams. This milestone report is based on exploratory data analysis of the SwifKey data provided in the context of the Coursera Data Science Capstone.

I use the tm package to construct functions that tokenize the sample and construct matrices of uniqrams, bigrams, and trigrams. Based on testing of the N-grams, it is clear that further work is required to improve the predictive power of the alorithm.


Data Science Capstone Milestone Report

The data set includes text files from various languages. Rda” ggplot head trigram. After that we can use RWeka package to create one-gram, bi-grams,tri-grams sets, and sort them to get the top 10 used n-grames. In this case, we created four different N-grams as follows:. Our prediction model will then give a list of suggested words to update the next word.

coursera capstone project milestone report

This will create a unigram Dataframe, which we will then manipulate so we can chart the frequencies using ggplot. This section describes the process to create a sample file foursera dataset from the three raw data files.

coursera capstone project milestone report

Build basic n-gram model. To get a sense of what the data looks like, I summerized the main information from each of the 3 datasets Blog, News and Twitter.

coursera capstone project milestone report

Calculate Frequencies of N-Grams Step 7: Our target files are: Convert text to lowercase and remove punctuation and numbers. Does including it in the training data contribute to more accurate predictions? The predictive model will be trained using a corpus, a collection of written texts, called the HC Corpora which has been filtered by language.

After reducing the size of each data set that were loaded sampled data is used to create a corpus, and following clean up steps are performed. We also remove profanity that we do not want to predict. As a next step a model will be created and integrated into a Shiny prouect for word prediction.


To summarize the all info until now, I seleted an small subset of each data and compared with the main files.

RPubs – Coursera Data Science Capstone: Milestone Report

Another assumption is that the command wc is available in the target system. The first analysis we ca;stone perform projetc a unigram analysis. While the strategy for modeling and prediction has not been finalized, the n-gram model with a frequency look-up table might be used based on the analysis above. Corpus process Step 5: This makes intuitive sense. Future Work My next steps will be: Sample Summary A summary for the sample can be seen on the table below. In this case, we created four different N-grams as follows: I made a wordcloud.

Milestone Report for Coursera Data Science Specialization SwiftKey Capstone

A possible method of prediction is to use the 4-gram model to find the most likely next word first. The specific sections are as follows: It is assumed that the data has been downloaded, unzipped and placed into the active R directory, maintaining the folder structure. The numbers have been calculated by using the wc command.

Each of these N-grams is transformed into a two column dataframe containing the following columns: