Best Online Data Science Experts Help

  • Data Preprocessing
    The textual data that we collect is too messy and unprocessed. The messier the data we train our model on the poorer will be the model’s accuracy of prediction. Therefore for this reason preprocessing is one of the most important step of this process. Cleaning textual data is a lot different from cleaning any other data since it contains vocabulary words from any language instead of numbers. There are various Online Preprocessing Steps Help one could follow for better results some of these are written below:
    1: Tokenization
    Tokenization is the process of splitting a textual corpora in to a set of words or sentences known as tokens at . A corpora when split into words are called word tokenization and when split into sentences is known as sentence tokenization. There are many libraries in python as well as in R which deal with tokenization for example keras, sklearn , textblob , preprocessor etc
    2: Remove punctuation
    Normally when we extract text from an online source say twitter , the data consists of a lot of punctuations since people tend to build emojis using different punctuation symbols. And also people sometimes use unnecessary punctuations everywhere for example some people use many fullstop symbols together to build a line. All these need to be removed from the data set as these are useless and make no sence. can remove these punctuations simply by using Regex library.
    3: Stemming
    A stemming algorithm is a process of linguistic normalisation, in which the variant forms of a word are reduced to a common form.
    4: Lemmatization
    Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word. Text Preprocessing includes both stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words. Lemmatization could be done in python using NLTK framework.
    5: Removing html
    Sometimes what happenes is when we extract data from a web page or using some API, the data which recieve consists of many html tags such as <h></h> or <p></p>. These tags add no meaning to our data thus these need to be removed. Feature Selection Choosing correct features is directly linked with how well an algorithm is going to perform. Different researchers have used different features for different problems. Choosing the right features for a given problem is very important.
    After selecting the features , these features need to be vectorized since the model only understands numbers. This could be done in a number of ways. Some of recommended methods include :
    1 : Using a Countvectorizer (sklearn.feature_extraction.text.CountVectorizor)
    Countvectorizer is present in python’s sklearn package. The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
    2 : Using a TFIDF vectorizer(sklearn.feature_extraction.text.TfidfVectorizer)
    Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and Text Mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

Log in to reply