Data Science

Data mining and machine learning are one of key technologies for advanced data analytics. We are developing practical data mining processes, and also try to create new business opportunities in which data analysis and knowledge discovery are very important factors.

Our main focus is on:

  1. Theoretical research in data mining and machine learning
  2. Development data mining software (i.e., data mining engines)
  3. Establishment of data mining processes for business applications

The first activity is research of fundamental principles to invent novel data mining algorithms. We are studying a bunch of element cores including data modeling, optimization, etc. from mathematical and statistical viewpoints. The second is development of efficient and scalable data mining algorithms. We review the results of the first activity from viewpoints of computational/memory/maintenance efficiency, and implement them as software libraries. The last step is technology transfer of the aforementioned research and development states to real-world business applications.

Some of our work includes:

Imparting Data Knowledge in discrete data volumes using crowded agent approach for multi-perspective and visualized big data

The modern world is faced with the issues and concerns of business intelligence. Methodologies and techniques have been developed to facilitate the process of business analysis and comprehension. One such scientific field is focused on achieving the intelligent data before it can be utilized for intelligent analysis. The current size of information is huge and the tasks aimed out of analysis present a complex situation. These perceptions can be handled by using the right and optimal techniques from artificial intelligence. This paper is focused on achieving multi-agent perspective architecture for using data rawness and discrepancies to turn them into data intelligence and opportunities. The MAS technique has been used to generate faster data processing and for imparting data with knowledge of its own.

A Rough Set Based Feature Selection Approach Using Random Feature Vectors

Feature selection is the process of selecting a subset of features that provides maximum of the information, present otherwise in entire dataset. The process is very helpful when input for different tasks including classification, clustering, rule extraction and many others, is large. Rough Set Theory, right from its emergence, has been widely used for feature selection due to its analysis friendly nature. Various approaches exist in literature for this purpose. However, majority of them are computationally too expensive and suffer a significant performance bottleneck. In this paper we have proposed a new feature selection approach based on rough set theory, using random feature vector generation method. The proposed approach is a two steps method. At first, it generates a random feature vector and verifies its suitability for being a potential candidate solution. If it fulfills the criteria, it is selected and optimized, otherwise a new subset is formed. The proposed approach was verified using five publicly available datasets. Results have shown that proposed approach is computationally more efficient and produces optimal results.

An incremental dependency calculation technique for feature selection using rough sets

In many fields, such as data mining, machine learning and pattern recognition, datasets containing large numbers of features are often involved. In such cases, feature selection is necessary. Feature selection is the process of selecting a feature subset on behalf of the entire dataset for further processing. Recently, rough set-based approaches, which use attribute dependency to carry out feature selection, have been prominent. However, this dependency measure requires the calculation of the positive region, which is a computationally expensive task. In this paper, we have proposed a new concept called the “Incremental Dependency Class” (IDC), which calculates the attribute dependency without using the positive region. IDCs define the change in attribute dependency as we move from one record to another. IDCs, by avoiding the positive region, can be an ideal replacement for the conventional dependency measure in feature selection algorithms, especially for large datasets. Experiments conducted using various publically available datasets from the UCI repository have shown that calculating dependency using IDCs reduces the execution time by 54%, while in the case of feature selection algorithms using IDCs, the execution time was reduced by almost 66%. Overall, a 68% decrease in required runtime memory was also found. 

Self Adapting and prioritizing database algorithm for providing big data insight in domain knowledge and processing of volume based instructions based on scheduled and contextual shifting of data

Modern world is not only about software and technology as the world advances it is becoming more data oriented and mathematical in nature. The current size of information that is brought in and processed is large and complex in size. Data size does not only involve using every single point of data that is reported. This information needs to be sized down and understood according to the application at hand. Data size is one issue and the other issue is the knowledge or information that needs to be extracted from it in order to obtain and achieve the purposeful meaning from the data. In memory and column oriented databases have presented viable and efficient solutions to optimize query time and column compressions. In addition to storing and retrieving data the information world has stepped up into big data with millions and terabytes of data as influx every single second. With the increase in the influx of data and out flux of responses generated and required. The world is now in need of both systems and software’s that are efficient in storing huge data as well as application layer algorithms that are efficient enough to extract meaning from the layers or topologically dependent data. This paper is focused on analyzing in column store technique for managing mathematical and scientific big data involved in multiple markets; by using topological data meaning for analyzing and understanding the information from adaptive database systems. And for efficient storing in database the column oriented approach to big data analytics and query layers will be analyzed and optimized.

Evolutionary testing using particle swarm optimization in IOT applications

Internet of things (IOT) is coming up in a major way connecting all physical objects and managing communications and interactions. These highly informative and data intensive applications are both critical to create and manage. The research under consideration proposes an evolutionary algorithm that uses particle swarm optimization to obtain a wide search space according to the IOT data space. The testing search space has particles which are the candidate solutions to predicted errors for all encountered and un-encountered error possibilities. For each search space, particle speed and velocity moments are calculated and adjusted in perturbed iterations, depending upon the expected level of discrepancy that might appear or according to influx of data change and co-relation. This research implements the POS algorithm for optimizing IOT applications over dynamic periods of time. IOT is the future and thus needs to be both protected and tested for more comprehensive advantages coming in through IOT applications.

 

A Bayesian Classifiers based Combination Model for Automatic Text Classification

Text classification deals with allocating a text document to a predetermined class. Generally, this involves learning about a class from representations of documents belonging to that class. In this paper, we propose a classifier combination that uses a Multinomial Naïve Bayesian (MNB) classifier along with Bayesian Networks (BN) classifier. The results of two classifiers are combined by taking an average of the probability distributions calculated by each of the two classifiers. Feature extraction and selection techniques have been incorporated with the model to find the most discriminating terms for classification. This classification model has been tested on three real text datasets. According to experiments, this approach showed better performance and the overall accuracy is higher than the accuracies of the two constituent classifiers. This technique also surpasses the accuracy of other well known, standard classifiers. This approach differs from the previous classification techniques in that it successfully incorporates MNB and BN classifiers and shows significantly better results than using either of the two classifiers separately. A comparative study of previous approaches with our method indicates a significant improvement over a number of techniques that were evaluated on the same dataset.

Identification and Correction of Misspelled Drugs’ Names in Electronic Medical Records (EMR)

Medications are an important element of medical records but they usually contain significant data errors. This situation may result from haphazardness or possibly careless storage of valuable information. In either case, this misspelled data can cause serious health problems for the patients and can put their life at a major risk. Thus, the correctness of medication data is an important aspect so that potential harms can be identified and steps can be taken to prevent or mitigate them. In this paper, a novel and practical method is proposed for automated detection and correction of spelling errors in electronic medical record (EMR). To realize this technique, major relevant aspects is taken into consideration with the help of Parts-of-Speech tagging and Regular Expressions. The paper concludes with recommendations and future work for giving a new direction to the emendation of drug nomenclature.

A hybrid feature selection approach based on heuristic and exhaustive algorithms using Rough set theory

A dataset may have many irrelevant and unnecessary features, which not only increase computational space but also lead to a very critical phenomenon called curse of dimensionality. Feature selection process aims at selecting some relevant features for further processing on behalf of the entire dataset. However, to extract such information is non-trivial task, especially for large datasets. In literature many feature selection approaches have been proposed but recently rough set based heuristic approaches have become prominent ones. However, these approaches do not ensure the optimum solution. In this paper, a hybrid approach for feature selection has been proposed, based on heuristic algorithm and exhaustive search. Heuristic algorithm finds initial feature subset which is then further optimized by exhaustive search. We have used genetic algorithm and particle swarm optimization as preprocessor and relative dependency for optimization. Experiments have shown that our proposed approach is more effective and efficient as compared to the conventional relative dependency based approach.

Rule Induction Using Enhanced RIPPER Algorithm for Clinical Decision Support System

Due to availability of large amount of data with the emergence of computers and internet, data mining is getting popular in every field of life like business, health, disasters etc for predictive analysis. As more and more data becomes available, it becomes difficult to get useful information from that. In that case, that tremendous data is quite useless. For that purpose data mining comes as a savior and helps us to extract useful information out of the data. This information can be used further for decision making. This paper presents a model that helps in diagnosis of diseases by analyzing the patients’ data. The patients’ attributes are analyzed and association rules are extracted from these attributes. Association rule based Classification is used for disease diagnosis and thus helpful in clinical decision making. A patient is classified as healthy or sick based on his attributes using classification. Disease Mining Model is proposed (DMM) based on association rules mining (ARM). This model is globally optimized by using Weighted Association Rules Mining (WARM) as Optimized Disease Mining Model (ODMM) which provides improved accuracy of disease prediction for every disease dataset. Both DMM and ODMM are tested on nine datasets of different diseases. Results of disease diagnosis are verified against real diagnosis. WARM improves the accuracy of diagnosis and thus outperforms ARM. Thus in this work, Classification using Ripper algorithm is much improved using weight optimization.

Content-Specific Unigrams and Syntactic Phrases to Enhance Senti Word Net Based Sentiment Classification

Sentiment classification intelligently detects the polarity of documents by ascertaining polar values encapsulated in the document to classify them into positive and negative sentiments. Machine learning classifier completely relies on the feature set orientations. SentiWordNet is a lexical resource where each term is associated with numerical scores for subjective and objective sentiment information. SentiWordNet based sentiment classifier uses sentiment features generated from 7% subjective terms available in the resource. Sentiment features bear generic orientation for multiple domains but lacks comprehensive coverage e.g. Text unit with null or few sentiment features reflects ambiguous or null sentiments. Use of content specific unigrams and syntactic phrases along with sentiment features ensures consistency in the classification while enhancing the performance paradigm. Model proposed in this research is validated on sentiment and polarity datasets. Results of this research, completely out performs previous approaches and methods.

Flexibility and Privacy Control By Cookie Management

The privacy of internet users is continuously on stack from various directions with the evolution of technology. Modern technology in the field of internet poses serious threats on the privacy of users. Unfortunately, while surfing on internet, we are careless about our privacy and allow intrusion of privacy to a great extent without objection. This facilitates advertisers in tracking user activities on web by third party cookies. Researchers have been conducting vigorous research on this topic and also have presented solutions to control the leakage of privacy without user consent. But surprisingly, major research activities confined to the desktop platform and little is known about web tracking on mobile devices. We survey current technologies and purpose a novel approach for android based mobile devices which control excessive tracking of users. Further, Mozilla Firefox add-ons and other related proposals dealing with cookies and privacy are analyzed.

Inference Engine for Classification of Expert Systems Using Keyword Extraction Technique

Because of the fast-growing demands in automated document dispensation, a steadfast system for automatic identification of keywords entrenched in an electronic document is of immense concern. The paper envisaged an innovative approach for the classification of multiple Expert System (ES) methodologies at a time on the basis of keyword extraction using a commercial text mining tools WordStat and Compare-Suite Pro. These ES methodologies include eleven categories that include; rule-based systems, database methodology ,case-based reasoning, intelligent agent, knowledge-based systems, fuzzy based expert system, object oriented methodology, neural networks, system architecture, systems, modelling, and ontology. The keywords are selected on the basis of frequency analysis and position of most recurring word in context within the article tile, abstract and keywords of respective ES methodology. Based on the extracted keywords, an inference engine has been designed on java software. This software compares the keyword established from the articles of individual ES methodology with all other articles of the remaining methodologies using generation of association rules. The inference engine developed was first calibrated for 100 articles out of 160 and then validated for remaining 60 articles. The validation results shows accuracy of the experimental results up to 85 percent. The paper concludes that the classification of Expert Systems using keyword extraction technique, outperforming a base line, is a more accurate, reliable and optimal with respect to time as compared to other orthodox methods of text mining. At the end it has been concluded that the techniques may further improved by limiting design constraints in the tools adopted in the research for future endeavours.

Developing an expert system based on association rules and predicate logic for earthquake prediction

Expert systems (ES) are a branch of applied artificial intelligence. The basic idea behind ES is simply that expertise, which is the vast body of task-specific knowledge, is transferred from a human to a computer. ES provide powerful and flexible means for obtaining solutions to a variety of problems that often cannot be dealt with by other, more traditional and orthodox methods. Thus, their use is proliferating to many sectors of our social and technological life, where their applications are proving to be critical in the process of decision support and problem solving. Earthquake professionals for many decades have recognized the benefits to society from reliable earthquake predictions, but uncertainties regarding source initiation, rupture phenomena, and accuracy of both the timing and magnitude of the earthquake occurrence have often times seemed either very difficult or impossible to overcome. This research proposes and implements an expert system to predict earthquakes from previous data. This is achieved by applying association rule mining on earthquake data from 1972 to 2013. These associations are polished using predicate-logic techniques to draw stimulating production-rules to be used with a rule-based expert system. The proposed expert system was able to predict all earthquakes which actually occurred within 12 hours at-most.

A Rough-Set Feature Selection Model for Classification and Knowledge Discovery

Feature selection aims to remove features unnecessary to the target concept. Rough-set theory (RST) eliminates unimportant or irrelevant features, thus generating a smaller (than the original) set of attributes with the same, or close to, classificatory power. This paper analyses the effects of rough sets on classification using 10 datasets, each including a decision attribute. Classification accuracy mapped to the type and number of attributes both in the original and the reduced datasets. This generates a framework for applying rough-sets for classification purposes. Rough-sets are then used for knowledge discovery in classification and the conclusion indicate a very significant result that removal of individual numeric attributes has far more effect on classification accuracy than removal of categorical attributes.

Texture Classification Using Rotation- and Scale-Invariant Gabor Texture Features

This letter introduces a novel approach to rotation and scale invariant texture classification. The proposed approach is based on Gabor filters that have the capability to collapse the filter responses according to the scale and orientation of the textures. These characteristics are exploited to first calculate the homogeneous texture of images followed by the rearrangement of features as a two-dimensional matrix (scale and orientation), where scaling and rotation of images correspond to shifting in this matrix. The shift invariance property of discrete fourier transform is used to propose rotation and scale invariant image features. The performance of the proposed feature set is evaluated on Brodatz texture album. Experimental results demonstrate the superiority of the proposed descriptor as compared to other methods considered in this letter.

Global Optimization Ensemble Model for Classification Methods

Supervised learning is the process of data mining for deducing rules from training datasets. A broad array of supervised learning algorithms exists, every one of them with its own advantages and drawbacks. There are some basic issues that affect the accuracy of classifier while solving a supervised learning problem, like bias-variance tradeoff, dimensionality of input space, and noise in the input data space. All these problems affect the accuracy of classifier and are the reason that there is no global optimal method for classification. There is not any generalized improvement method that can increase the accuracy of any classifier while addressing all the problems stated above. This paper proposes a global optimization ensemble model for classification methods (GMC) that can improve the overall accuracy for supervised learning problems. The experimental results on various public datasets showed that the proposed model improved the accuracy of the classification models from 1% to 30% depending upon the algorithm complexity.