Natural Language Processing

With the increasing importance of the Web and other text-heavy application areas, the demands for and interest in both text mining and natural language processing (NLP) have been rising.  We have focused on solutions to the problem of data deluge by replacing or supplementing the human reader with automatic systems undeterred by the text explosion.

We develop software that analyses large collections of documents to discover previously unknown information. The information might be relationships or patterns that are buried in the document collection and which would otherwise be extremely difficult, if not impossible, to discover.

Some of our work in NLP includes:


Sentiment Analysis on User Reviews through Lexicon and Rule-based Approach

Computers need data and humans need information. The process of conversion of data into useful information needs analysis to be done onto it. Reviews of customers are valuable as they are an important source of data for multiple purposes. However, these feedbacks are subjective so, extraction of information is not an easy task. This paper presents a different method of sentiment analysis research on reviews. The main focus is the data mining from multiple trustworthy sites and categorization of this data. The results are efficient and better than available multiple approaches. The paper concludes with recommendations and future work for giving a new direction to ontology-based Opinion Mining.

A Novel Approach for Searching Linguistic Synonyms through Parts of Speech Tagging

Synonym-based searching is recognized to be a complicated problem; as text mining from unstructured data of web is challenging. Finding useful information which matches user need from the bulk of web pages is a cumbersome task. In this paper, a novel and practical synonym retrieval technique is proposed for addressing this problem. For replacement of semantics, user intent is taken into consideration to realize the technique. To realize this technique, pattern generation is taken into consideration with the help of Parts-of-Speech tagging and Web Scrapping. Two approaches were built i.e. Non-Context Based Searching, and Context-Based Searching while the latter technique proved to be a more efficient in dealing with intent-based linguistic semantics than the former one. The paper concludes with recommendations and future work by giving a new direction to natural language

Using Distributional Semantics for Automatic Taxonomy Creation

This paper explains the construction of taxonomies of specialized domains using language-independent, statistical methodology. The methodology relies on the term’s distributional semantics. The algorithm captured the terms co-occurrence in large corpora. In a first step, terms’ syntagmatic relations are analyzed which provide the basis seed terms for taxonomy construction. The results include the list of hypernym candidates, for the each seed term. The second step involves the analysis of paradigmatic relations of the terms. This relation is between the hypernym term and its co-occurring term. The results of Step 2 are more refined and an appropriate hypernym lists. In the final step, the taxonomy is constructed using the resulted hypernym lists. The terms are connected with asymmetrically to taxonomy at a specific depth. Proposed idea has been properly discussed with some sample corpus to ensure its effectiveness. Sample corpus has been used to demonstrate proposed idea effectively. The recall and precision of proposed algorithm are 78.6% and 79.8% respectively. The proposed algorithm significantly improves the results quality.

Multi-Objective Model Selection (MOMS) based Semi-Supervised Framework for Sentiment Analysis

Sentiment analysis has emerged as an active research field due to the rapid growth of user-generated content on the Internet. This research area analyzes the opinions and attitudes of masses toward products, movies, topics, individuals, and services. Various machine learning and text mining algorithms have been used for sentiment analysis and classification. The recent research concludes that domain-specific lexicons perform significantly better as compared to domain-independent lexicons. The proposed research aims at improving the performance of general-purpose lexicons utilizing machine learning algorithms. A semi-supervised framework based on “MOMS” is introduced in order to determine the feature weight by incorporating SentiWordNet, a well-known general-purpose sentiment lexicon. The feature weights are learned by support vector machine, and the classification performance is enhanced by using Multi-Objective Model Selection procedure. Subjectivity criterion is used to select the desired features, and the effects of feature selection with respect to their part-of-speech information are studied comprehensively. Experimental evaluation is performed on seven different benchmark datasets which includes Large movie review dataset, Multi-domain sentiment dataset, and Cornell movie review dataset. The comparison of the proposed approach is performed with state-of-the-art techniques, lexicon-based approaches, and other methods for sentiment analysis. The proposed framework results in high performance when compared to other research in this field.

SWIMS: Semi-Supervised Subjective Feature Weighting and Intelligent Model Selection for Sentiment Analysis

Sentiment Analysis, also called Opinion Mining, is currently one of the most studied research fields. Its aim is to analyze publics’ sentiments, opinions, attitudes etc., towards different elements such as topics, products, individuals, organizations, or services. Sentiment classification can be achieved by machine learning or lexical based methodologies or a combination of both. In an effort to improve the performance of domain independent lexicons, this research incorporates machine learning with a lexical based approach introducing a new framework called SWIMS to determine the feature weight based on a well-known general-purpose sentiment lexicon, SentiWordNet. Support vector machine is used to learn the feature weights and an intelligent model selection approach is employed in order to enhance the classification performance. The features are selected based on their subjectivity and the effects of feature selection with respect to their part of speech information are studied extensively. Seven benchmark datasets have been used in this research including large movie review dataset, multi-domain sentiment dataset and Cornell movie review dataset, all of which are available online. In-depth performance comparison is conducted with the state of art machine learning approaches and lexical based methodologies. The evaluation of performance measures proves that the proposed framework outperforms other techniques for sentiment analysis.

URWF: User Reputation based Weightage Framework for Twitter Micropost Classification

Sentiment analysis is an emerging field that helps in understanding the sentiments of users on microblogging sites. Many sentiment analysis techniques have been proposed by researchers that classify and analyze the sentiments from micropost posted by various users. Majorly, these techniques perform text based classification that does not allow predicting the micropost impact. Further, it is very difficult to analyze this huge volume of online content produced each day. Therefore, an effective technique for sentiment analysis is required that not only performs the precise textbased classification but also makes the analysis easy by reducing the volume of data. Moreover, micropost impact must also be determined in order to segregate the high impact microposts in corpus. In the present study, we have presented sentiment analysis framework that incorporates any text based classification and separates out the high impact microposts from low impact by calculating the factor of user reputation. This user reputation is calculated by considering multiple factors regarding user activities that may help organizations to know customer opinions and views about their products and services. This way, volume of data becomes small that has to be analyzed by considering only microposts posted by high impact users. Multiple text classifications classes are introduced instead of just positive, negative and neutral for precise sentiment classification. The proposed framework also calculates the accumulated weight of each micropost by multiplying the user reputation with the assigned sentiment score. The user reputation calculation factors are validated by using Spearman rho and Kendall tau correlation coefficient. The framework is further evaluated by using the Sanders topic based corpus and results are presented.

A Semi-Supervised Approach to Sentiment Analysis using Revised Sentiment Strength based on SentiWordNet

An immense amount of data is available with the advent of social media in the last decade. This data can be used for sentiment analysis and decision making. The data present on blogs, news/review sites, social networks, etc. is so enormous that manual labeling is not feasible and an automatic approach is required for its analysis. The sentiment of the masses can be understood by analyzing this large scale and opinion rich data. The major issues in the application of automated approaches are data unavailability, data sparsity, domain independence and inadequate performance. This research proposes a semi-supervised sentiment analysis approach that incorporates lexicon based methodology with machine learning in order to improve sentiment analysis performance. Mathematical models such as Information Gain and Cosine Similarity are employed to revise the sentiment scores defined in SentiWordNet. This research also emphasizes on the importance of nouns and employs them as semantic features with other parts of speech. The evaluation of performance measures and comparison with state of the art techniques proves that the proposed approach is superior.

eSAP: A Decision Support Framework for Enhanced Sentiment Analysis and Polarity Classification

Sentiment analysis or opinion mining is an imperative research area of natural language processing. It is used to determine the writer’s attitude or speaker’s opinion towards a particular person, product or topic. Polarity or subjectivity classification is the process of categorizing a piece of text into positive or negative classes. In recent years, various supervised and unsupervised methods have been presented to accomplish sentiment polarity detection. SentiWordNet (SWN) has been extensively used as a lexical resource for opinion mining. This research incorporates SWN as the labeled training corpus where the sentiment scores are extracted based on the part of speech information. A vocabulary SWN-V with revised sentiment scores, generated from SWN, is then used for Support Vector Machines model learning and classification process. Based on this vocabulary, a framework named “Enhanced Sentiment Analysis and Polarity Classification (eSAP)” is proposed. Training, testing and evaluation of the proposed eSAP are conducted on seven benchmark datasets from various domains. 10-fold cross validated accuracy, precision, recall, and f-measure results averaged over seven datasets for the proposed framework are 80.82%, 80.83%, 80.94% and 80.81% respectively. A notable performance improvement of 13.4% in accuracy, 14.2% in precision, 6.9% in recall and 11.1% in f-measure is observed on average by evaluating the proposed eSAP against the baseline SWN classifier. State of the art performance comparison is conducted which also verifies the superiority of the proposed eSAP framework.

Lexicon based Semantic Detection of Sentiments using Expected Likelihood Estimate Smoothed Odds Ratio

Sentiment analysis is an active research area in today’s era due to the abundance of opinionated data present on online social networks. Semantic detection is a sub-category of sentiment analysis which deals with the identification of sentiment orientation in any text. Many sentiment applications rely on lexicons to supply features to a model. Various machine learning algorithms and sentiment lexicons have been proposed in research in order to improve sentiment categorization. Supervised machine learning algorithms and domain specific sentiment lexicons generally perform better as compared to the unsupervised or semi-supervised domain independent lexicon based approaches. The core hindrance in the application of supervised algorithms or domain specific sentiment lexicons is the unavailability of sentiment labeled training datasets for every domain. On the other hand, the performance of algorithms based on general purpose sentiment lexicons needs improvement. This research is focused on building a general purpose sentiment lexicon in a semi-supervised manner. The proposed lexicon defines word semantics based on Expected Likelihood Estimate Smoothed Odds Ratio that are then incorporated with supervised machine learning based model selection approach. A comprehensive performance comparison verifies the superiority of our proposed approach.

Senti-CS: Building a Lexical Resource for Sentiment Analysis using the Subjective Feature Selection and Normalized Chi-Square based Feature Weight Generation

Sentiment analysis involves the detection of sentiment content of text using natural language processing. Natural language processing is a very challenging task due to syntactic ambiguities, named entity recognition, use of slangs, jargons, sarcasm, abbreviations and contextual sensitivity. Sentiment analysis can be performed using supervised as well as unsupervised approaches. As the amount of data grows, unsupervised approaches become vital as they cut down on the learning time and the requirements for availability of a labelled dataset. Sentiment lexicons provide an easy application of unsupervised algorithms for text classification. SentiWordNet is a lexical resource widely employed by many researchers for sentiment analysis and polarity classification. However, the reported performance levels need improvement. The proposed research is focused on raising the performance of SentiWordNet3.0 by using it as a labelled corpus to build another sentiment lexicon, named Senti-CS. The part of speech information, usage based ranks and sentiment scores are used to calculate Chi-Square-based feature weight for each unique subjective term/part-of-speech pair extracted from SentiWordNet3.0. This weight is then normalized in a range of −1 to +1 using min–max normalization. Senti-CS based sentiment analysis framework is presented and applied on a large dataset of 50000 movie reviews. These results are then compared with baseline SentiWordNet, Mutual Information and Information Gain techniques. State of the art comparison is performed for the Cornell movie review dataset. The analyses of results indicate that the proposed approach outperforms state-of-the-art classifiers.

SentiMI: Introducing Point-wise Mutual Information with SentiWordNet to Improve Sentiment Polarity Detection

Supervised learning has attracted much attention in recent years. As a consequence, many of the state-of-the-art algorithms are domain dependent as they require a labeled training corpus to learn the domain features. This requires the availability of labeled corpora which is a cumbersome task in itself. However, for text sentiment detection SentiWordNet (SWN) may be used. It is a vocabulary where terms are arranged in synonym groups called synsets. This research makes use of SentiWordNet and treats it as the labeled corpus for training. A sentiment dictionary, SentiMI, builds upon the mutual information calculated from these terms. A complete framework is developed by using feature selection and extracting mutual information, from SentiMI, for the selected features. Training, testing and evaluation of the proposed framework are conducted on a large dataset of 50,000 movie reviews. A notable performance improvement of 7% in accuracy, 14% in specificity and 8% in f-measure is achieved by the proposed framework as compared to the baseline SentiWordNet classifier. Comparison with the state-of-the art classifiers is also performed on widely used Cornell Movie Review dataset which also proves the effectiveness of the proposed approach.

Building Normalized SentiMI to enhance semi-supervised sentiment analysis

Sentiment analysis and polarity detection is a type of text classification where natural language opinion is analyzed in order to classify it into either positive or negative categories. Classification of text into sentiment labels is a very difficult task as opinions expressed in natural language may contain abbreviations, slangs, sarcasm, irony and/or idioms. The proposed research focuses on the use of SentiWordNet3.0 as a labeled corpus for training purposes. We present a complete framework based on a dictionary named Normalized SentiMI (nSentiMI) which is created by calculating point-wise mutual information for each term/part-of-speech pair extracted from SentiWordNet. The proposed framework is applied on a dataset of 50,000 movie reviews to identify the value of a weight factor α and then evaluated on an unseen test dataset of 2000 movie reviews. Comparison with state of art techniques also confirms the superiority of proposed approach.

SentiView: A Visual Sentiment Analysis Framework

In the past few years, micro-blogging platforms, such as twitter, are becoming most popular online social networks. Different opinions and news can be shared about various aspects and occasions using these micro-blogging platforms. Twitter is therefore considered as a rich source of data and it can be used for different text analysis and decision making tasks. The main focus of sentiment analysis is about text classification into positive/negative/neutral feelings based on the polarity of text. The opinions and thoughts on twitter feeds can be expressed in any language. Previous techniques have some limitations in the field of sentiment analysis such as low accuracy, sarcasm, and incorrect classification of tweets. The proposed research focuses on the existing difficulties and complications and presents a framework, for the sentiment detection of twitter feeds, which results in high accuracy and real time performance. There are various pre-processing steps that are applied on twitter feeds to refine them before feeding for sentiment classification. The pre-processing removes slangs and abbreviations with complete words. Three different classification techniques are then used; emoticon analysis, Bag of words and SentiWordNet. The experimental evaluation confirms that the proposed algorithm dynamically increases the precision, recall, f-measure and most importantly accuracy when compared with other similar techniques.

TOM: Twitter opinion mining framework using hybrid classification scheme

Twitter has become one of the most popular micro-blogging platform recently. Millions of users can share their thoughts and opinions about different aspects and events on the micro-blogging platform. Therefore, Twitter is considered as a rich source of information for decision making and sentiment analysis. Sentiment analysis refers to a classification problem where the main focus is to predict the polarity of words and then classify them into positive and negative feelings with the aim of identifying attitude and opinions that are expressed in any form or language.

Sentiment analysis over Twitter offers organisations a fast and effective way to monitor the publics’ feelings towards their brand, business, directors, etc. A wide range of features and methods for training sentiment classifiers for Twitter datasets have been researched in recent years with varying results. The primary issues in previous techniques are classification accuracy, data sparsity and sarcasm, as they incorrectly classify most of the tweets with a very high percentage of tweets incorrectly classified as neutral. This research paper focuses on these problems and presents an algorithm for twitter feeds classification based on a hybrid approach. The proposed method includes various pre-processing steps before feeding the text to the classifier. Experimental results show that the proposed technique overcomes the previous limitations and achieves higher accuracy when compared to similar techniques.

Related Publication:

Khan, Farhan Hassan, Saba Bashir, and Usman Qamar. “TOM: Twitter opinion mining framework using hybrid classification scheme.” Decision Support Systems 57 (2014): 245-257.