In the current era, with the advancement of technology, more and more data is available in digital
form. Among which, most of the data (approx. 85%) is in unstructured textual form. So it has become essential to
develop better techniques and algorithms to extract useful and interesting information from this large amount of
textual data. Text mining is process of extracting useful data from unstructured text. The algorithm used for text
mining has advantages and disadvantages. Moreover the issues in the field of text mining that affect the accuracy
and relevance of the results are identified.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
The document presents a new approach called Bat-Cluster (BC) for automated graph clustering. BC combines the Fast Fourier Domain Positioning (FFDP) algorithm and the Bat Algorithm. FFDP positions graph nodes, then Bat Algorithm optimizes clustering by finding configurations that minimize the Davies-Bouldin Index. BC is tested on four benchmark graphs and outperforms Particle Swarm Optimization, Ant Colony Optimization, and Differential Evolution in providing higher clustering precision.
This document summarizes literature on using bio-inspired algorithms to optimize fuzzy clustering. It describes the general architecture of how bio-inspired optimization algorithms can be applied to optimize parameters of fuzzy clustering algorithms and improve clustering quality. The document reviews several popular bio-inspired optimization algorithms and analyzes literature on optimization fuzzy clustering, identifying China, India, and the United States as the top publishing countries. Network analysis is applied to literature on the topic to identify clusters in the research.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
This document provides a survey of optimization approaches that have been applied to text document clustering. It discusses several clustering algorithms and categorizes them as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, frequent pattern-based clustering, and constraint-based clustering. It then describes several soft computing techniques that have been used as optimization approaches for text document clustering, including genetic algorithms, bees algorithms, particle swarm optimization, and ant colony optimization. These optimization techniques perform a global search to improve the quality and efficiency of document clustering algorithms.
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesIRJET Journal
The document discusses feature subset selection for high dimensional data using clustering techniques. It proposes the FAST algorithm which has three steps: 1) remove irrelevant features, 2) divide features into clusters using DBSCAN, and 3) select the most representative feature from each cluster. DBSCAN is a density-based clustering algorithm that can identify clusters of varying densities and detect outliers. The FAST algorithm is evaluated to select a small number of discriminative features from high dimensional data in an efficient manner. It aims to remove irrelevant and redundant features to improve predictive accuracy while handling large feature sets.
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
This document discusses a hybrid data mining approach called combined mining that can generate informative patterns from complex data sources. It proposes applying three techniques: 1) Using the Lossy-counting algorithm on individual data sources to obtain frequent itemsets, 2) Generating incremental pair and cluster patterns using a multi-feature approach, 3) Combining FP-growth and Bayesian Belief Network using a multi-method approach to generate classifiers. The approach is tested on two datasets to obtain more useful knowledge and the results are compared.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach IJECEIAES
The document presents a new approach called Bat-Cluster (BC) for automated graph clustering. BC combines the Fast Fourier Domain Positioning (FFDP) algorithm and the Bat Algorithm. FFDP positions graph nodes, then Bat Algorithm optimizes clustering by finding configurations that minimize the Davies-Bouldin Index. BC is tested on four benchmark graphs and outperforms Particle Swarm Optimization, Ant Colony Optimization, and Differential Evolution in providing higher clustering precision.
This document summarizes literature on using bio-inspired algorithms to optimize fuzzy clustering. It describes the general architecture of how bio-inspired optimization algorithms can be applied to optimize parameters of fuzzy clustering algorithms and improve clustering quality. The document reviews several popular bio-inspired optimization algorithms and analyzes literature on optimization fuzzy clustering, identifying China, India, and the United States as the top publishing countries. Network analysis is applied to literature on the topic to identify clusters in the research.
Predictive job scheduling in a connection limited system using parallel genet...Mumbai Academisc
The document discusses predictive job scheduling in a connection limited system using parallel genetic algorithms. It introduces the problem of job scheduling in parallel computing systems and describes existing non-predictive greedy algorithms. The proposed approach uses genetic algorithms to develop a predictive model for job scheduling that learns from previous experiences to improve scheduling efficiency over time. The goal is to schedule jobs in a way that optimizes system metrics like utilization and throughput while minimizing user metrics like turnaround time.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
This document reviews existing methods for improving the K-means clustering algorithm. K-means is widely used but has limitations such as sensitivity to outliers and initial centroid selection. The document summarizes several proposed approaches, including using MapReduce to select initial centroids and form clusters for large datasets, reducing execution time by cutting off iterations, improving cluster quality by selecting centroids systematically, and using sampling techniques to reduce I/O and network costs. It concludes that improved algorithms address K-means limitations better than the traditional approach.
This document provides an overview and summary of Pankaj Jajoo's 2008 master's thesis on improving document clustering algorithms. The thesis explores two approaches: 1) preprocessing the graph representation of documents to remove noise before applying standard graph partitioning algorithms, and 2) clustering words first before clustering documents to reduce noise. Experimental results on three datasets show these approaches improve clustering quality over standard K-Means clustering. The thesis provides background on clustering, reviews existing document clustering methods, and describes the two new algorithms and evaluation of their performance.
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...cscpconf
Data Mining is one of the most significant tools for discovering association patterns that are useful for many knowledge domains. Yet, there are some drawbacks in existing mining techniques. Three main weaknesses of current data-mining techniques are: 1) re-scanning of the entire database must be done whenever new attributes are added. 2) An association rule may be true on a certain granularity but fail on a smaller ones and vise verse. 3) Current methods can only be used to find either frequent rules or infrequent rules, but not both at the same time. This research proposes a novel data schema and an algorithm that solves the above weaknesses while improving on the efficiency and effectiveness of data mining strategies. Crucial mechanisms in each step will be clarified in this paper. Finally, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
Attribute Selection is an important topic in Data Mining, because it is the effective way for reducing dimensionality, removing irrelevant data, removing redundant data, & increasing accuracy of the data. It is the process of identifying a subset of the most useful attributes that produces compatible results as the original entire set of attribute. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group called a cluster are more similar in some sense or another to each other than to those in other groups (Clusters). There are various approaches & techniques for attribute subset selection namely Wrapper approach, Filter Approach, Relief Algorithm, Distributional clustering etc. But each of one having some disadvantages like unable to handle large volumes of data, computational complexity, accuracy is not guaranteed, difficult to evaluate and redundancy detection etc. To get the upper hand on some of these issues in attribute selection this paper proposes a technique that aims to design an effective clustering based attribute selection method for high dimensional data. Initially, attributes are divided into clusters by using graph-based clustering method like minimum spanning tree (MST). In the second step, the most representative attribute that is strongly related to target classes is selected from each cluster to form a subset of attributes. The purpose is to increase the level of accuracy, reduce dimensionality; shorter training time and improves generalization by reducing over fitting.
Mining Frequent Item set Using Genetic Algorithmijsrd.com
By applying rule mining algorithms, frequent itemsets are generated from large data sets e.g. Apriori algorithm. It takes so much computer time to compute all frequent itemsets. We can solve this problem much efficiently by using Genetic Algorithm(GA). GA performs global search and the time complexity is less compared to other algorithms. Genetic Algorithms (GAs) are adaptive heuristic search & optimization method for solving both constrained and unconstrained problems based on the evolutionary ideas of natural selection and genetic. The main aim of this work is to find all the frequent itemsets from given data sets using genetic algorithm & compare the results generated by GA with other algorithms. Population size, number of generation, crossover probability, and mutation probability are the parameters of GA which affect the quality of result and time of calculation.
Applying genetic algorithms to information retrieval using vector space modelIJCSEA Journal
The document describes a study that applied genetic algorithms to information retrieval using the vector space model. The study used an adaptive genetic algorithm approach with two proposed fitness functions (cosine and Jaccard's), adaptive crossover and mutation probabilities. Experimental results on a test corpus showed improvements in precision and recall compared to traditional approaches, with the proposed cosine fitness function performing best. Precision generally decreased as recall increased. The modifications made to the genetic algorithm and fitness functions led to better weighting of query terms and improved results.
Applying Genetic Algorithms to Information Retrieval Using Vector Space ModelIJCSEA Journal
Genetic algorithms are usually used in information retrieval systems (IRs) to enhance the information retrieval process, and to increase the efficiency of the optimal information retrieval in order to meet the users’ needs and help them find what they want exactly among the growing numbers of available information. The improvement of adaptive genetic algorithms helps to retrieve the information needed by the user accurately, reduces the retrieved relevant files and excludes irrelevant files. In this study, the researcher explored the problems embedded in this process, attempted to find solutions such as the way of choosing mutation probability and fitness function, and chose Cranfield English Corpus test collection on mathematics. Such collection was conducted by Cyrial Cleverdon and used at the University of Cranfield in 1960 containing 1400 documents, and 225 queries for simulation purposes. The researcher also used cosine similarity and jaccards to compute similarity between the query and documents, and used two proposed adaptive fitness function, mutation operators as well as adaptive crossover. The process aimed at evaluating the effectiveness of results according to the measures of precision and recall. Finally, the study concluded that we might have several improvements when using adaptive genetic algorithms.
This document presents a feature clustering algorithm to reduce the dimensionality of feature vectors for text classification. The algorithm groups words in documents into clusters based on similarity, with each cluster characterized by a membership function. Words not similar to existing clusters form new clusters. This avoids specifying features in advance and the need for trial and error. Experimental results showed the method can classify text faster and with better extracted features than other methods.
Incremental learning from unbalanced data with concept class, concept drift a...IJDKP
Recently, stream data mining applications has drawn vital attention from several research communities.
Stream data is continuous form of data which is distinguished by its online nature. Traditionally, machine
learning area has been developing learning algorithms that have certain assumptions on underlying
distribution of data such as data should have predetermined distribution. Such constraints on the problem
domain lead the way for development of smart learning algorithms performance is theoretically verifiable.
Real-word situations are different than this restricted model. Applications usually suffers from problems
such as unbalanced data distribution. Additionally, data picked from non-stationary environments are also
usual in real world applications, resulting in the “concept drift” which is related with data stream
examples. These issues have been separately addressed by the researchers, also, it is observed that joint
problem of class imbalance and concept drift has got relatively little research. If the final objective of
clever machine learning techniques is to be able to address a broad spectrum of real world applications,
then the necessity for a universal framework for learning from and tailoring (adapting) to, environment
where drift in concepts may occur and unbalanced data distribution is present can be hardly exaggerated.
In this paper, we first present an overview of issues that are observed in stream data mining scenarios,
followed by a complete review of recent research in dealing with each of the issue.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
Nature Inspired Models And The Semantic WebStefan Ceriu
In this paper we present a series of nature inspired models used as alternative solutions for Semantic Web concerns. Some of the methods presented in this article perform better than classic algorithms by enhancing response time and computational costs. Others are just proof of concept, first steps towards new techniques that will improve their respective field. The intricate nature of the Semantic Web urges the need for faster, more intelligent algorithms and nature inspired models have been proven to be more than suitable for such complex tasks.
Introduction to feature subset selection methodIJSRD
Data Mining is a computational progression to ascertain patterns in hefty data sets. It has various important techniques and one of them is Classification which is receiving great attention recently in the database community. Classification technique can solve several problems in different fields like medicine, industry, business, science. PSO is based on social behaviour for optimization problem. Feature Selection (FS) is a solution that involves finding a subset of prominent features to improve predictive accuracy and to remove the redundant features. Rough Set Theory (RST) is a mathematical tool which deals with the uncertainty and vagueness of the decision systems.
Feature selection in high-dimensional datasets is
considered to be a complex and time-consuming problem. To
enhance the accuracy of classification and reduce the execution
time, Parallel Evolutionary Algorithms (PEAs) can be used. In
this paper, we make a review for the most recent works which
handle the use of PEAs for feature selection in large datasets.
We have classified the algorithms in these papers into four main
classes (Genetic Algorithms (GA), Particle Swarm Optimization
(PSO), Scattered Search (SS), and Ant Colony Optimization
(ACO)). The accuracy is adopted as a measure to compare the
efficiency of these PEAs. It is noticeable that the Parallel Genetic
Algorithms (PGAs) are the most suitable algorithms for feature
selection in large datasets; since they achieve the highest accuracy.
On the other hand, we found that the Parallel ACO is timeconsuming
and less accurate comparing with other PEA.
This document describes a proposed modified cluster-based fuzzy-genetic data mining algorithm. The algorithm aims to mine both association rules and membership functions from quantitative transaction data. It uses a genetic algorithm approach that represents each set of membership functions as a chromosome. Chromosomes are clustered using a modified k-means approach to reduce computational costs. The representative chromosome of each cluster is used to calculate fitness values. Offspring are produced through genetic operators and selected through roulette wheel selection. The algorithm iterates until obtaining a set of membership functions with high fitness. These are then used to mine multilevel fuzzy association rules from the transaction data. The algorithm is illustrated through a simple example involving transaction data containing purchases of items like milk, bread, etc
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
Data Streams are sequential set of data records. When data appears at highest speed and constantly, so predicting
the class accordingly to the time is very essential. Currently Ensemble modeling techniques are growing
speedily in Classification of Data Stream. Ensemble learning will be accepted since its benefit to manage huge
amount of data stream, means it will manage the data in a large size and also it will be able to manage concept
drifting. Prior learning, mostly focused on accuracy of ensemble model, prediction efficiency has not considered
much since existing ensemble model predicts in linear time, which is enough for small applications and
accessible models workings on integrating some of the classifier. Although real time application has huge
amount of data stream so we required base classifier to recognize dissimilar model and make a high grade
ensemble model. To fix these challenges we developed Ensemble tree which is height balanced tree indexing
structure of base classifier for quick prediction on data streams by ensemble modeling techniques. Ensemble
Tree manages ensembles as geodatabases and it utilizes R tree similar to structure to achieve sub linear time
complexity
Ontology based clustering algorithms aim to standardize clustering by incorporating domain knowledge through ontologies. They calculate similarity matrices between objects using ontology-based methods, then merge the closest clusters and recalculate the matrix in an iterative process. Several ontology based clustering algorithms are discussed, including Apriori, which generates frequent item sets to cluster data, and algorithms that use ontologies to weight features or perform recursive mining on an FP-tree. These algorithms integrate distributed semantic web data through ontologies to improve search, classification and reuse of knowledge resources.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
The document summarizes text mining techniques in data mining. It discusses common text mining tasks like text categorization, clustering, and entity extraction. It also reviews several text mining algorithms and techniques, including information extraction, clustering, classification, and information visualization. Several literature papers applying these techniques to domains like movie reviews, research proposals, and e-commerce are also summarized. The document concludes that text mining can extract useful patterns from unstructured text through techniques like clustering, classification, and information extraction.
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
Survey on evolutionary computation tech techniques and its application in dif...ijitjournal
In computer science, 'evolutionary computation' is an algorithmic tool based on evolution. It implements
random variation, reproduction and selection by altering and moving data within a computer. It helps in
building, applying and studying algorithms based on the Darwinian principles of natural selection. In this
paper, studies about different evolutionary computation techniques used in some applications specifically
image processing, cloud computing and grid computing is carried out briefly. This work is an effort to help
researchers from different fields to have knowledge on the techniques of evolutionary computation
applicable in the above mentioned areas.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
This document summarizes a research paper that proposes a new Particle Swarm Optimization (PSO) based K-Prototype clustering algorithm to cluster mixed numeric and categorical data. It begins with background information on clustering algorithms like K-Means, K-Modes, and K-Prototype. It then describes the K-Prototype algorithm, PSO, and discrete binary PSO. Related work integrating PSO with other clustering algorithms is also reviewed. The proposed approach uses binary PSO to select improved initial prototypes for K-Prototype clustering in order to obtain better clustering results than traditional K-Prototype and avoid local optima.
This document discusses using particle swarm optimization to improve the k-prototype clustering algorithm. The k-prototype algorithm clusters data with both numeric and categorical attributes but can get stuck in local optima. The proposed method uses particle swarm optimization, a global optimization technique, to guide the k-prototype algorithm towards better clusterings. Particle swarm optimization models potential solutions as particles that explore the search space. It is integrated with k-prototype clustering to avoid locally optimal solutions and produce better clusterings. The method is tested on standard benchmark datasets and shown to outperform traditional k-modes and k-prototype clustering algorithms.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...ijaia
Feature selection and classification task are an essential process in dealing with large data sets that
comprise numerous number of input attributes. There are many search methods and classifiers that have
been used to find the optimal number of attributes. The aim of this paper is to find the optimal set of
attributes and improve the classification accuracy by adopting ensemble rule classifiers method. Research
process involves 2 phases; finding the optimal set of attributes and ensemble classifiers method for
classification task. Results are in terms of percentage of accuracy and number of selected attributes and
rules generated. 6 datasets were used for the experiment. The final output is an optimal set of attributes
with ensemble rule classifiers method. The experimental results conducted on public real dataset
demonstrate that the ensemble rule classifiers methods consistently show improve classification accuracy
on the selected dataset. Significant improvement in accuracy and optimal set of attribute selected is
achieved by adopting ensemble rule classifiers method.
The document summarizes research on multi-document summarization using EM clustering. It begins with an introduction to the topic and issues with existing techniques. It then proposes using Expectation-Maximization (EM) clustering to identify clusters, which improves over other methods by identifying latent semantic variables between sentences. The architecture involves preprocessing, EM clustering, mutual reinforcement ranking algorithms RARP and RDRP, summarization, and post-processing. Experimental results on DUC2007 data show EM clustering identifies more clusters and sentences than affinity propagation clustering. The technique aims to improve summarization accuracy by better capturing semantic relationships between sentences.
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...ijcseit
To mine out relevant facts at the time of need from web has been a tenuous task. Research on diverse fields
are fine tuning methodologies toward these goals that extracts the best of information relevant to the users
search query. In the proposed methodology discussed in this paper find ways to ease the search complexity
tackling the severe issues hindering the performance of traditional approaches in use. The proposed
methodology find effective means to find all possible semantic relatable frequent sets with FP Growth
algorithm. The outcome of which is the further source of fuel for Bio inspired Fuzzy PSO to find the optimal
attractive points for the web documents to get clustered meeting the requirement of the search query
without losing the relevance. On the whole the proposed system optimizes the objective function of
minimizing the intra cluster differences and maximizes the inter cluster distances along with retention of all
possible relationships with the search context intact. The major contribution being the system finds all
possible combinations matching the user search transaction and thereby making the system more
meaningful. These relatable sets form the set of particles for Fuzzy Clustering as well as PSO and thus
being unbiased and maintains a innate behaviour for any number of new additions to follow the herd
behaviour’s evaluations reveals the proposed methodology fares well as an optimized and effective
enhancements over the conventional approaches.
To mine out relevant facts at the time of need from web has been a tenuous task. Research on diverse fields
are fine tuning methodologies toward these goals that extracts the best of information relevant to the users
search query. In the proposed methodology discussed in this paper find ways to ease the search complexity
tackling the severe issues hindering the performance of traditional approaches in use. The proposed
methodology find effective means to find all possible semantic relatable frequent sets with FP Growth
algorithm. The outcome of which is the further source of fuel for Bio inspired Fuzzy PSO to find the optimal
attractive points for the web documents to get clustered meeting the requirement of the search query
without losing the relevance.
Similar to Survey on Efficient Techniques of Text Mining (20)
Understanding the Impact and Challenges of Corona Crisis on Education Sector...vivatechijri
n the second week of March 2020, governments of all states in a country suddenly declared
shutting down of all colleges and schools for a temporary period of time as an immediate measure to stop the
spread of pandemic that is of novel corona virus. As the days pass by almost close to a month with no certainty
when they will again reopen. Due to pandemic like this an alarm bells have started sounding in the field of
education where a huge impact can be seen on teaching and learning process as well as on the entire education
sector in turn. The pandemic disruption like this is actually gave time to educators of today to really think about
the sector. Through the present research article, the author is highlighting on the possible impact of
coronavirus on education sector with the future challenges for education sector with possible suggestions.
LEADERSHIP ONLY CAN LEAD THE ORGANIZATION TOWARDS IMPROVEMENT AND DEVELOPMENT vivatechijri
This document discusses the importance of leadership in leading an organization towards improvement and development. It states that leadership is responsible for providing a clear vision and strategy to successfully achieve that vision. Effective leadership can impact the success of an organization by controlling its direction and motivating employees. Leadership is different from traditional management in that it guides employees towards organizational goals through open communication and motivation, rather than simply directing work. The paper concludes that only leadership can lead an organization to change according to its evolving environment, while management may simply follow old rules. Leadership is key to adapting to new market needs and trends.
The topic of assignment is a critical problem in mathematics and is further explored in the real
physical world. We try to implement a replacement method during this paper to solve assignment problems with
algorithm and solution steps. By using new method and computing by existing two methods, we analyse a
numerical example, also we compare the optimal solutions between this new method and two current methods. A
standardized technique, simple to use to solve assignment problems, may be the proposed method
Structural and Morphological Studies of Nano Composite Polymer Gel Electroly...vivatechijri
The document summarizes research on a nano composite polymer gel electrolyte containing SiO2 nanoparticles. Key points:
1. Polyvinylidene fluoride-co-hexafluoropropylene polymer was used as the base polymer mixed with propylene carbonate, magnesium perchlorate, and SiO2 nanoparticles to synthesize the nano composite polymer gel electrolyte.
2. The electrolyte was characterized using XRD, SEM, and FTIR which confirmed the homogeneous dispersion of SiO2 nanoparticles and increased amorphous nature of the electrolyte, enhancing its ion conductivity.
3. XRD showed decreased crystallinity and disappearance of polymer peaks upon addition of SiO2. SEM revealed
Theoretical study of two dimensional Nano sheet for gas sensing applicationvivatechijri
This study is focus on various two dimensional material for sensing various gases with theoretical
view for new research in gas sensing application. In this paper we review various two dimensional sheet such as
Graphene, Boron Nitride nanosheet, Mxene and their application in sensing various gases present in the
atmosphere.
METHODS FOR DETECTION OF COMMON ADULTERANTS IN FOODvivatechijri
Food is essential forliving. Food adulteration deceives consumers and can endanger their health. The
purpose of this document is to list common food adulterant methods commonly found in India. An adulterant is
a substance found in other substances such as food, cosmetics, pharmaceuticals, fuels, or other chemicals that
compromise the safety or effectiveness of that substance. The addition of adulterants is called adulteration. The
most common reason for adulteration is the use of undeclared materials by manufacturers that are cheaper than
the correct and declared ones. The adulterants can be harmful or reduce the effectiveness of the product, or
they can be harmless.
The novel ideas of being a entrepreneur is a key for everyone to get in the hustle, but developing a
idea from core requires a systematic plan, time management, time investment and most importantly client
attention. The Time required for developing may vary from idea to idea and strength of the team. Leadership to
build a team and manage the same throughout the peak of development is the main quality. Innovations and
Techniques to qualify the huddles is another aspect of Business Development and client Retention.
Innovation for supporting prosperity has for quite some time been a focus on numerous orders, including PC science, brain research, and human-PC connection. In any case, the meaning of prosperity isn't continuously clear and this has suggestions for how we plan for and evaluate advances that intend to cultivate it. Here, we talk about current meanings of prosperity and how it relates with and now and then is a result of self-amazing quality. We at that point center around how innovations can uphold prosperity through encounters of self-amazing quality, finishing with conceivable future bearings.
An Alternative to Hard Drives in the Coming Future:DNA-BASED DATA STORAGEvivatechijri
Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up, there emerges a requirement for a storage medium with high capacity, high storage density, and possibility to face up to extreme environmental conditions. According to a research in 2018, every minute Google conducted 3.88 million searches, other people posted 49,000 photos on Instagram, sent 159,362,760 e-mails, tweeted 473,000 times and watched 4.33 million videos on YouTube. In 2020 it estimated a creation of 1.7 megabytes of knowledge per second per person globally, which translates to about 418 zettabytes during a single year. The magnetic or optical data-storage systems that currently hold this volume of 0s and 1s typically cannot last for quite a century. Running data centres takes vast amounts of energy. In short, we are close to have a substantial data-storage problem which will only become more severe over time. Deoxyribonucleic acid (DNA) are often potentially used for these purposes because it isn't much different from the traditional method utilized in a computer. DNA’s information density is notable, 215 petabytes or 215 million gigabytes of data can be stored in just one gram of DNA. First we can encode all data at a molecular level and then store it in a medium that will last for a while and not become out-dated just like floppy disks. Due to the improved techniques for reading and writing DNA, a rapid increase is observed in the amount of possible data storage in DNA.
The usage of chatbots has increased tremendously since past few years. A conversational interface is an interface that the user can interact with by means of a conversation. The conversation can occur by speech but also by text input. When a chatty interface uses text, it is also described as a chatbot or a conversational medium. During this study, the user experience factors of these so called chatbots were investigated. The prime objective is “to spot the state of the art in chatbot usability and applied human-computer interaction methodologies, to research the way to assess chatbots usability". Two sorts of chatbots are formulated, one with and one without personalisation factors. the planning of this research may be a two-by-two factorial design. The independent variables are the two chatbots (unpersonalised versus personalised) and thus the speci?c task or goal the user are ready to do with the chatbot within the ?nancial ?eld (a simple versus a posh task). The results are that there was no noteworthy interaction effect between personalisation and task on the user experience of chatbots. A signi?cant di?erence was found between the two tasks with regard to the user experience of chatbots, however this variation wasn't because of personalisation.
The Smart glasses Technology of wearable computing aims to identify the computing devices into today’s world.(SGT) are wearable Computer glasses that is used to add the information alongside or what the wearer sees. They are also able to change their optical properties at runtime.(SGT) is used to be one of the modern computing devices that amalgamate the humans and machines with the help of information and communication technology. Smart glasses is mainly made up of an optical head-mounted display or embedded wireless glasses with transparent heads- up display or augmented reality (AR) overlay in it. In recent years, it is been used in the medical and gaming applications, and also in the education sector. This report basically focuses on smart glasses, one of the categories of wearable computing which is very popular presently in the media and expected to be a big market in the next coming years. It Evaluate the differences from smart glasses to other smart devices. It introduces many possible different applications from the different companies for the different types of audience and gives an overview of the different smart glasses which are available presently and will be available after the next few years.
Future Applications of Smart Iot Devicesvivatechijri
With the Internet of Things (IoT) bit by bit creating as the resulting time of the headway of the Internet, it gets critical to see the diverse expected zones for the utilization of IoT and the research challenges that are connected with these applications going from splendid savvy urban areas, to medical care administrations, shrewd farming, collaborations and retail. IoT is needed to attack into for all expectations and purposes for all pieces of our day-to-day life. Despite the fact that the current IoT enabling advancements have immensely improved in the continuous years, there are so far different issues that require attention. Since the IoT ideas results from heterogeneous advancements, many examination difficulties will arise. In like manner, IoT is planning for new components of exploration to be finished. This paper presents the progressing headway of IoT advancements and inspects future applications.
Cross Platform Development Using Fluttervivatechijri
Today the development of cross-platform mobile application has under the state of compromise. The developers are not willing to choose an alternative of either building the similar app many times for many operating systems or to accept a lowest common denominator and optimal solution that will going to trade the native speed, accuracy for portability. The Flutter is an open-source SDK for creating high-performance, high fidelity mobile apps for the development of iOS and Android. Few significant features of flutter are - Just-in-time compilation (JIT), Ahead- of-time compilation (AOT compilation) into a native (system-dependent) machine code so that the resulting binary file can execute natively. The Flutter’s hot reload functionality helps us to understand quickly and easily experiment, build UIs, add features, and fix bugs. Hot reload works by injecting updated source code files into the running Dart Virtual Machine (VM). With the help of Flutter, we believe that we would be having a solution that gives us the best of both worlds: hardware accelerated graphics and UI, powered by native ARM code, targeting both popular mobile operating systems.
The Internet, today, has become an important part of our lives. The World Wide Web that was once a small and inaccessible data storage service is now large and valuable. Current activities partially or completely integrated into the physical world can be made to a higher standard. All activities related to our daily life are mapped and linked to another business in the digital world. The world has seen great strides in the Internet and in 3D stereoscopic displays. The time has come to unite the two to bring a new level of experience to the users. 3D Internet is a concept that is yet to be used and requires browsers to be equipped with in-depth visualization and artificial intelligence. When this material is included, the Internet concept of material may become a reality discussed in this paper. In this paper we have discussed the features, possible setting methods, applications, and advantages and disadvantages of using the Internet. With this paper we aim to provide a clear view of 3D Internet and the potential benefits associated with this obviously cost the amount of investment needed to be used.
Recommender System (RS) has emerged as a significant research interest that aims to assist users to seek out items online by providing suggestions that closely match their interests. Recommender system, an information filtering technology employed in many items is presented in internet sites as per the interest of users, and is implemented in applications like movies, music, venue, books, research articles, tourism and social media normally. Recommender systems research is usually supported comparisons of predictive accuracy: the higher the evaluation scores, the higher the recommender. One amongst the leading approaches was the utilization of advice systems to proactively recommend scholarly papers to individual researchers. In today's world, time has more value and therefore the researchers haven't any much time to spend on trying to find the proper articles in line with their research domain. Recommender Systems are designed to suggest users the things that best fit the user needs and preferences. Recommender systems typically produce an inventory of recommendations in one among two ways -through collaborative or content-based filtering. Additionally, both the general public and also the non-public used descriptive metadata are used. The scope of the advice is therefore limited to variety of documents which are either publicly available or which are granted copyright permits. Recommendation systems (RS) support users and developers of varied computer and software systems to beat information overload, perform information discovery tasks and approximate computation, among others.
The study LiFi (Light Fidelity) demonstrates about how can we use this technology as a medium of communication similar to Wifi . This is the latest technology proposed by Harold Haas in 2011. It explains about the process of transmitting data with the help of illumination of an Led bulb and about its speed intensity to transmit data. Basically in this paper, author will discuss about the technology and also explain that how we can replace from WiFi to LiFi . WiFi generally used for wireless coverage within the buildings while LiFi is capable for high intensity wireless data coverage in limited areas with no obstacles .This research paper represents introduction of the Lifi technology,performance,modulation and challenges. This research paper can be used as a reference and knowledge to develop some of LiFitechnology.
Social media platform and Our right to privacyvivatechijri
The advancement of Information Technology has hastened the ability to disseminate information across the globe. In particular, the recent trends in ‘Social Networking’ have led to a spark in personally sensitive information being published on the World Wide Web. While such socially active websites are creative tools for expressing one’s personality it also entails serious privacy concerns. Thus, Social Networking websites could be termed a double edged sword. It is important for the law to keep abreast of these developments in technology. The purpose of this paper is to demonstrate the limits of extending existing laws to battle privacy intrusions in the Internet especially in the context of social networking. It is suggested that privacy specific legislation is the most appropriate means of protecting online privacy. In doing so it is important to maintain a balance between the competing right of expression, the failure of which may hinder the reaping of benefits offered by Internet technology
THE USABILITY METRICS FOR USER EXPERIENCEvivatechijri
THE USABILITY METRICS FOR USER EXPERIENCE was innovatively created by Google engineers and it is ready for production in record time. The success of Google is to attributed the efficient search algorithm, and also to the underlying commodity hardware. As Google run number of application then Google’s goal became to build a vast storage network out of inexpensive commodity hardware. So Google create its own file system, named as THE USABILITY METRICS FOR USER EXPERIENCE that is GFS. THE USABILITY METRICS FOR USER EXPERIENCE is one of the largest file system in operation. Generally THE USABILITY METRICS FOR USER EXPERIENCE is a scalable distributed file system of large distributed data intensive apps. In the design phase of THE USABILITY METRICS FOR USER EXPERIENCE, in which the given stress includes component failures , files are huge and files are mutated by appending data. The entire file system is organized hierarchically in directories and identified by pathnames. The architecture comprises of multiple chunk servers, multiple clients and a single master. Files are divided into chunks, and that is the key design parameter. THE USABILITY METRICS FOR USER EXPERIENCE also uses leases and mutation order in their design to achieve atomicity and consistency. As of there fault tolerance, THE USABILITY METRICS FOR USER EXPERIENCE is highly available, replicas of chunk servers and master exists.
Google File System was innovatively created by Google engineers and it is ready for production in record time. The success of Google is to attributed the efficient search algorithm, and also to the underlying commodity hardware. As Google run number of application then Google’s goal became to build a vast storage network out of inexpensive commodity hardware. So Google create its own file system, named as Google File System that is GFS. Google File system is one of the largest file system in operation. Generally Google File System is a scalable distributed file system of large distributed data intensive apps. In the design phase of Google file system, in which the given stress includes component failures , files are huge and files are mutated by appending data. The entire file system is organized hierarchically in directories and identified by pathnames. The architecture comprises of multiple chunk servers, multiple clients and a single master. Files are divided into chunks, and that is the key design parameter. Google File System also uses leases and mutation order in their design to achieve atomicity and consistency. As of there fault tolerance, Google file system is highly available, replicas of chunk servers and master exists.
A Study of Tokenization of Real Estate Using Blockchain Technologyvivatechijri
Real estate is by far one of the most trusted investments that people have preferred, being a lucrative investment it provides a steady source of income in the form of lease and rents. Although there are numerous advantages, one of the key downsides of real estate investments is lack of liquidity. Thus, even though global real estate investments amount to about twice the size of investments in stock markets, the number of investors in the real estate market is significantly lower. Block chain technology has real potential in addressing the issues of liquidity and transparency, opening the market to even retail investors. Owing to the functionality and flexibility of creating Security Tokens, which are backed by real-world assets, real estate can be made liquid with the help of Special Purpose Vehicles. Tokens of ERC 777 standard, which represent fractional ownership of the real estate can be purchased by an investor and these tokens can also be listed on secondary exchanges. The robustness of Smart Contracts can enable the efficient transfer of tokens and seamless distribution of earnings amongst the investors. This work describes Ethereum blockchainbased solutions to make the existing Real Estate investment system much more efficient.
Understanding Cybersecurity Breaches: Causes, Consequences, and PreventionBert Blevins
Cybersecurity breaches are a growing threat in today’s interconnected digital landscape, affecting individuals, businesses, and governments alike. These breaches compromise sensitive information and erode trust in online services and systems. Understanding the causes, consequences, and prevention strategies of cybersecurity breaches is crucial to protect against these pervasive risks.
Cybersecurity breaches refer to unauthorized access, manipulation, or destruction of digital information or systems. They can occur through various means such as malware, phishing attacks, insider threats, and vulnerabilities in software or hardware. Once a breach happens, cybercriminals can exploit the compromised data for financial gain, espionage, or sabotage. Causes of breaches include software and hardware vulnerabilities, phishing attacks, insider threats, weak passwords, and a lack of security awareness.
The consequences of cybersecurity breaches are severe. Financial loss is a significant impact, as organizations face theft of funds, legal fees, and repair costs. Breaches also damage reputations, leading to a loss of trust among customers, partners, and stakeholders. Regulatory penalties are another consequence, with hefty fines imposed for non-compliance with data protection regulations. Intellectual property theft undermines innovation and competitiveness, while disruptions of critical services like healthcare and utilities impact public safety and well-being.
OCS Training Institute is pleased to co-operate with
a Global provider of Rig Inspection/Audits,
Commission-ing, Compliance & Acceptance as well as
& Engineering for Offshore Drilling Rigs, to deliver
Drilling Rig Inspec-tion Workshops (RIW) which
teaches the inspection & maintenance procedures
required to ensure equipment integrity. Candidates
learn to implement the relevant standards &
understand industry requirements so that they can
verify the condition of a rig’s equipment & improve
safety, thus reducing the number of accidents and
protecting the asset.
In May 2024, globally renowned natural diamond crafting company Shree Ramkrishna Exports Pvt. Ltd. (SRK) became the first company in the world to achieve GNFZ’s final net zero certification for existing buildings, for its two two flagship crafting facilities SRK House and SRK Empire. Initially targeting 2030 to reach net zero, SRK joined forces with the Global Network for Zero (GNFZ) to accelerate its target to 2024 — a trailblazing achievement toward emissions elimination.
Encontro anual da comunidade Splunk, onde discutimos todas as novidades apresentadas na conferência anual da Spunk, a .conf24 realizada em junho deste ano em Las Vegas.
Neste vídeo, trago os pontos chave do encontro, como:
- AI Assistant para uso junto com a SPL
- SPL2 para uso em Data Pipelines
- Ingest Processor
- Enterprise Security 8.0 (Maior atualização deste seu release)
- Federated Analytics
- Integração com Cisco XDR e Cisto Talos
- E muito mais.
Deixo ainda, alguns links com relatórios e conteúdo interessantes que podem ajudar no esclarecimento dos produtos e funções.
https://www.splunk.com/en_us/campaigns/the-hidden-costs-of-downtime.html
https://www.splunk.com/en_us/pdfs/gated/ebooks/building-a-leading-observability-practice.pdf
https://www.splunk.com/en_us/pdfs/gated/ebooks/building-a-modern-security-program.pdf
Nosso grupo oficial da Splunk:
https://usergroups.splunk.com/sao-paulo-splunk-user-group/
Social media management system project report.pdfKamal Acharya
The project "Social Media Platform in Object-Oriented Modeling" aims to design
and model a robust and scalable social media platform using object-oriented
modeling principles. In the age of digital communication, social media platforms
have become indispensable for connecting people, sharing content, and fostering
online communities. However, their complex nature requires meticulous planning
and organization.This project addresses the challenge of creating a feature-rich and
user-friendly social media platform by applying key object-oriented modeling
concepts. It entails the identification and definition of essential objects such as
"User," "Post," "Comment," and "Notification," each encapsulating specific
attributes and behaviors. Relationships between these objects, such as friendships,
content interactions, and notifications, are meticulously established.The project
emphasizes encapsulation to maintain data integrity, inheritance for shared behaviors
among objects, and polymorphism for flexible content handling. Use case diagrams
depict user interactions, while sequence diagrams showcase the flow of interactions
during critical scenarios. Class diagrams provide an overarching view of the system's
architecture, including classes, attributes, and methods .By undertaking this project,
we aim to create a modular, maintainable, and user-centric social media platform that
adheres to best practices in object-oriented modeling. Such a platform will offer users
a seamless and secure online social experience while facilitating future enhancements
and adaptability to changing user needs.
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...YanKing2
Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input program belongs, the outcome may differ when the model is trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization. We reported that 1) the reduction ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm–prompt engineering and interactive in-context learning and this study can save reduce the cost of invoking GPT-4 by 24%per API query. Importantly, SlimCode simplifies the input code with its greedy strategy and can obtain at most 133 times faster than the state-of-the-art technique with a significant improvement. This paper calls for a new direction on code-based, model-agnostic code simplification solutions to further empower LLMs.
20CDE09- INFORMATION DESIGN
UNIT I INCEPTION OF INFORMATION DESIGN
Introduction and Definition
History of Information Design
Need of Information Design
Types of Information Design
Identifying audience
Defining the audience and their needs
Inclusivity and Visual impairment
Case study.
Development of Chatbot Using AI/ML Technologiesmaisnampibarel
The rapid advancements in artificial intelligence and natural language processing have significantly transformed human-computer interactions. This thesis presents the design, development, and evaluation of an intelligent chatbot capable of engaging in natural and meaningful conversations with users. The chatbot leverages state-of-the-art deep learning techniques, including transformer-based architectures, to understand and generate human-like responses.
Key contributions of this research include the implementation of a context- aware conversational model that can maintain coherent dialogue over extended interactions. The chatbot's performance is evaluated through both automated metrics and user studies, demonstrating its effectiveness in various applications such as customer service, mental health support, and educational assistance. Additionally, ethical considerations and potential biases in chatbot responses are examined to ensure the responsible deployment of this technology.
The findings of this thesis highlight the potential of intelligent chatbots to enhance user experience and provide valuable insights for future developments in conversational AI.
A brief introduction to quadcopter (drone) working. It provides an overview of flight stability, dynamics, general control system block diagram, and the electronic hardware.
LeetCode Database problems solved using PySpark.pdf
Survey on Efficient Techniques of Text Mining
1. Volume 1, Issue 1 (2018)
Article No. 7
PP 1-7
1
www.viva-technology.org/New/IJRI
Survey on Efficient Techniques of Text Mining
Sunita Naik1
, Samiksha Gharat 1
, Saraswati shenoy1
, Rohini Kamble1
1
(Computer, VIVA Institute of Technology/ Mumbai University, India,)
Abstract: In the current era, with the advancement of technology, more and more data is available in digital
form. Among which, most of the data (approx. 85%) is in unstructured textual form. So it has become essential to
develop better techniques and algorithms to extract useful and interesting information from this large amount of
textual data. Text mining is process of extracting useful data from unstructured text. The algorithm used for text
mining has advantages and disadvantages. Moreover the issues in the field of text mining that affect the accuracy
and relevance of the results are identified.
Keywords –MWO, Consensus, PSO, Text mining, Bisecting K-means
1. INTRODUCTION
Data mining is the process of sorting through large data set to identify patterns and establish relationship
to solve problems through data analysis. The size of data is increasing at exponential rates day by day. Almost
every type of organization stored their data electronically. Text mining plays important role in search engine,
every text is digitally stored. (Stored in binary form that is 0, 1)Data mining is the mining of the predictive
information from database and it is new technology to help companies focus on the very important information in
their data bases. It is used to examine the old data to find the information. Since clustering is used and it is one of
the popular technique of data mining. It is a task of dividing a data into the number of similar clusters. Means it
is task of grouping a set of object in a same group that are similar to each other in the other group. Data clustering
technology is to finding the similar hidden pattern from the given data set. It is the method to obtaining the cluster
of the item without the class label related to the approximation of the item in one cluster. Clustering is the very
big amount of the data set that contains the large number of records with high dimensions. And now a days it used
for the identifying useful information from the historical data. The optimization is used to find the global
optimization solution. Now a days in real word the optimization problem are dynamic. It will not find the global
optimal solution but also find the trajectory of changing optimal solution over dynamic nature.The optimization
technique will give the optimal or good solution from the complex optimization problem.
2. Data Mining Techniques
2.1 A Review on Clustering Analysis based on Optimization Algorithm for Data mining [5]
Clustering analysis is one of the important concepts of data mining. It will divide the data into certain
classes according to the main attribute of the data set. It has drawback like optimal path, initialization of cluster
center. In this after applying k-mean, Bisecting k-mean is applied on obtained cluster. It will find the k number of
cluster of the apply data set. Then applying the optimization algorithm it will find the optimize path of the
clustering and increase the accuracy of the integrated hybrid algorithm.
In this Bisecting K-mean Technique is used along with PSO and they are good at maintaining final
cluster.
2. Volume 1, Issue 1 (2018)
Article No. 7
PP 1-7
2
www.viva-technology.org/New/IJRI
2.2 Bisecting K-means Algorithm for Text Clustering [14]
Three steps are used in this, the first one is Pre-processing text, it is easy to compare to natural language
documents. The second step is application of text mining Technique, in this the algorithm such as clustering,
classification, summarization, information extraction are used. The third step is analysis of text, in this the outputs
are analyzed for discovering the knowledge.
This paper gives the idea about basics of text mining.
2.3 Algorithm of Group Members' consensus orienting to Discussion Dynamic Process [6]
To solve this dynamic expansion process, they had proposed a new algorithm of group members’
consensus orienting to discussion dynamic process. According to the extraction and clustering of expert’s
discussion information, experts weight changes dynamically under discussion dynamic process. At the same time
the consensus state of group discussion change dynamically. If we claim C1 then, if focus=4, value is 0.1538 and
exact consensus vale is 3.3846.
This paper has an algorithm for calculating consensus value based on cluster analysis and the value of
modality and the method is feasible and effective.
2.4 Stability of Distributed Adaptive Algorithms I: Consensus Algorithms [7]
Performance analysis (convergence and mean squared error measures) has been pursued under two
regimes i.e. fixed gain (aka short memory) or vanishing gain (aka long memory). In vanished gain there are many
types of similarities where as in fixed gain there are less similarities. It has two types of noise. The first one is
white noise which has equal intensity at different frequencies and second one is colored noise which generates
random data.
Since this algorithm is good at removing noise sensation or error outputs so it can be used after applying
k-means.
2.5A Modified Particle Swarm Optimization with Dynamic Particles Re-initialization Period [8]
The particle swarm optimization (PSO) is an algorithm that attempts to search for better solution in the
solution space by attracting particles to converge toward a particle with the best fitness. In order to overcome
problems they have propose an improved PSO algorithm that can re-initialize particles dynamically when swarm
traps in local optimum. Moreover, the particle re-initialization period can be adjusted to solve the problem
appropriately. The proposed technique is tested on benchmark functions and gives more satisfied search results in
comparison with PSOs for the benchmark functions[9]. The PSO has many advantages such as rapid convergence,
simplicity, and little parameters to be adjusted. Its main disadvantage is trapping in local optimum and premature
convergence. Since the improved PSO technique is good at initializing cluster centre.
2.6 Mussels Wandering Optimization: An Ecologically Inspired Algorithm for Global
Optimization [1]
Over the past few year there are various complex optimization problem. To overcome problems of text
mining we use Mussels wandering optimization and also compare it with various algorithm to observe which
algorithm give better solution. Novel meta heuristic algorithm which is also called as mussels wandering
optimization technique is used in this paper.it is inspired by mussels locomotion behavior when they form bed
pattern in their habitant.it give more important to the mussels and find their density in habitant.one of the most
significant merits of MWO is it provide open frame work to tackle hard optimization problem.
2.7 A Data Clustering Algorithm Based on Mussels Wandering Optimization [2]
The clustering algorithm like k-means algorithm is used to form a cluster. but it have some drawback in
searching optimal solution, considering this drawback and limitation. To overcome these drawback in this paper
they proposed new algorithm based on k-mean and mussels wandering optimization.
The aim of this algorithm is to reach an optimal solution by mathematically modeling mussels.
3. Volume 1, Issue 1 (2018)
Article No. 7
PP 1-7
3
www.viva-technology.org/New/IJRI
In k-MWO ,each mussels represent a set of center of ‘K’ classes. the algorithm first initialize ‘N’ mussels and
evaluate each mussels fitness by using squared sum error. according to the fitness value we find the top mussels
and update their position in database.
This paper has given the idea of merging MWO with various algorithm and they get accuracy in point
tabular form and by combining these two algorithm we can make a full use of global optimization ability of MWO
and local search ability of the k-means algorithm.
2.8 A Survey Paper for Finding Frequent Pattern In Text Mining [3]
Text mining is very important method for finding important information from large amount of data.in
data mining there three important rule for finding frequent data pattern. First one is frequent pattern and second
is association rules .in this paper they used frequent pattern rule for temporal text mining. This technique involve
data mining and extracting information. Disadvantage of pattern based method is low frequency and
misinterpretation.in this may noisy parameter is discovered, to solve this problem they used term based method.
2.9 Text Mining: Techniques, application and issues [9]
This paper describes review of text mining. Over 80% information is made of unstructured and semi
structured information. Content mining is procedure of removing data from huge dataset. By choosing the great
strategy we can enhance the speed and lessening the time and efforts which are required to extract the information
or content. Some techniques used for text mining are Information Extraction, Information Retrieval, clustering,
text summarization. Application of Text Mining are Academic and research field Digital library, Business
Intelligence & Social Media.
This paper highlighted the techniques application and issues of text mining. Nowadays application of
text mining used in every field. NLP and entity recognition techniques reduced the issues that occur during text
mining process. Text mining tools also used in life science i.e. in biomedical field which provides an opportunity
to extract important information, their association and relationship among various diseases, species, and genes
etc.
2.10 A comparative Analysis of particle swarm optimization and k-mean Algorithm for Text
clustering using Nepali word net[10]
This paper discussed about particle swarm optimization and k-means algorithm. Paper portrays
investigation of three calculation i.e. k-means, particle swarm optimization and hybrid PSO+ k-means clustering.
Clustering is characterized as collection of information into bunches or groups with the goal that the information
or record in each group are similar to each group and dissimilar other group. Hybrid PSO +k-means algorithm
combines two modules PSO module & k-means module. This will first (hybrid) execute PSO clustering algorithm
by global search. PSO will terminate when no of iteration is done. The hybrid PSO algorithm combines the both
advantage i.e. globalize searching of PSO and fast coverage of k-means.
This paper Highlighted k-means, bisecting k-means and hybrid PSO+ k-means algorithm. The K-means
algorithm was compared with PSO and hybrid PSO+K-means algorithms. Hybrid PSO+K-means performs better
than PSO and K-means algorithms. Similarity between two documents need to be computed in a clustering
analysis. There are similarity measures are available to compute the similarity between two documents like
Euclidean distance, Manhattan distance, cosine similarity etc. among that cosine similarity measurement has been
used.
2.11 Review on clustering web data using PSO[11]
This paper described about the clustering technique for web data mining text extraction and clustering
are the main challenging tasks .The literature overview a developmental bio inspired swarm intelligence algorithm
called as particle swarm optimization for improve result. this algorithm will enhance the efficiency information is
conflicting, unstructured and fragmented such issue can be solved by utilizing prepossessing which will raw
information into extremely proper arrangement. Subsequent to proposing will apply PSO algorithm on web
information for clustering purpose of content utilized for the web text clustering.
This paper highlighted the particle swarm optimization algorithm as well as clustering techniques such
as Partition Clustering, Hierarchical Clustering, Density-based Clustering, Grid-based Clustering, model-based
4. Volume 1, Issue 1 (2018)
Article No. 7
PP 1-7
4
www.viva-technology.org/New/IJRI
Clustering, and Fuzzy Clustering. Also PSO compared with two other algorithms genetic algorithm and ACO
algorithm but PSO gives better result in terms of time, speed and it has low memory requirement & low
computational cost.
2.12 A limited Iteration Bisecting k-means for fast clustering large datasets[12]
This paper describes about the bisecting k –means algorithm with compared to k-mean algorithm. About
limit no. of iterations. It maintains the clustering quality with limited iteration. They have introduced bisecting k-
means which will divide two clusters using k-means with k=2 resulting in two clusters. This bisecting process
will continue until getting total no of cluster reaches to k. bisecting k-means is an improvement of k-means in
clustering quality as well in efficiency in large dataset. Each two means start with different pair with initial center.
This paper highlighted the limited iteration bisecting k-means for clustering the large dataset. The
original version bisecting k means performs multiple runs of two means. The bisecting k-means produces more
better and efficient clustering than the k-means.
3. ANALYSIS TABLE
Table 1: Analysis Table
Sr. No.
Title Technique/Methods Parameter Accuracy
1
Mussels
Wandering
Optimization:
An
Ecologically
Inspired
Algorithm For
Global
Optimization.
Mussels Wondering
Optimization.(MWO)
Function ‘f’
Function(f1) :
μ(d=20)
If μ = 1.5 then
the results:
Best = 273.99
Mean=1.47e+
4
2
A Data
Cluster
Algorithm
Based On
Mussels
Wandering
Optimization.
K-MEAN and Mussels
Wondering
Optimization(MWO).
DI :- it measure the
ratio between
distance and diameter
of cluster.
DI : Max -
0.1128 ,
Min -0.1009,
Mean-
0.1021.
DBI : Max -
0.4375, Min -
0.3916,
Mean -
0.4231.
3
Survey Paper
For Finding
Frequent
Pattern In
Text Mining.
frequent pattern rules ,
extracting information
rules.
5. Volume 1, Issue 1 (2018)
Article No. 7
PP 1-7
5
www.viva-technology.org/New/IJRI
4
Mussels
wondering
algorithm
based training
of artificial
neural
network for
pattern
classification.
In this paper they
applied MWO on
artificial neural network.
Classification
accuracy
training time
Classification
accuracy :
78.3
training time :
1.48 sec.
5
A Review on
Clustering
Analysis
based on
Optimization
Algorithm for
data mining
Bisecting k-mean and
Particle Swarm
Optimization (Used to
overcome the
dependency of method to
initialize the cluster).
For calculating
Distance between
cluster 1 and cluster2
If dist1>dist2 then
divide cluster1 into
two more cluster, if
dist2>dist1 then
again divide cluster
into two morw cluster
6
Bisecting k-
means
Algorithm for
Text
Clustering
Bisecting k-mean with
Time Complexity
To compute two
clusters with k=2 and
the run time
complexity of the
algorithm will be
O((K-1)IN).
7
Algorithm of
Group
Members'
consensus
orienting to
Discussion
Dynamic
Process
Consensus Building
Algorithm
Consensus value of
claim CJ is
Consensus(c)?LA; x
vij, A.i is expert i's
weight and Vij is
expert i's modality to
claim cj .
If we claim
C1 then ,if
focus=4,
value is
0.1538 and
exact
consensus
vale is 3.3846
8
Stability of
Distributed
Adaptive
Algorithms I:
Consensus
Algorithms
Analysis of a consensus
based distributed LMS
algorithm under some
colored noise
assumptions.
If µ[λmax(L) + max k
λmax(Rx,k)] < 2
This means that for
each node E( ˜ wk,t)
→ w∗.
9
A Modified
Particle
Swarm
Optimization
with Dynamic
Particles Re-
initialization
Period
Particle Swarm
Optimization
acceleration
constants of
1η and 2η is
1.496180 and
inertia weight
ω = 0.729844
population is
20.maximum
iteration is
5000
6. Volume 1, Issue 1 (2018)
Article No. 7
PP 1-7
6
www.viva-technology.org/New/IJRI
10
Text mining:
techniques
application
and issues
They used extraction,
information retrieval,
clustering and text
summarization.
11
comparative
analysis of
particle
swarm and k
means
algorithm for
text clustering
using nepali
wordnet
They used k means, pso
& hybrid pso+ k means
algorithm.
For 50
document
hybrid pso+ k
means gives
6.964 for
intra cluster
& 0.952 for
inter cluster.
12
Review on
clustering web
data using
particle
swarm
optimization
They use three
algorithm PSO,GA,
AGO.
Better cost, memory
requirement,
simplicity etc.
13
A limited
iteration
bisecting k
means for fast
clustering
datasets.
They used bisecting k
means also describes
limited iteration
bisecting k means
algorithm (LIBKM).
Bisecting k means is
better than k means.
This will keep the
limit of iteration no.
LIBKM will
divide 2
clusters using
k means with
k=2. this will
accurate the
clustering
quality by
removing
error &
validating the
cluster.
14
A survey on
particle
swarm
optimization
algorithm
application in
text mining
PSO based data
clustering method.
They have compared
PSO with GA, SA but
PSO gives better
result in terms of
accuracy &
efficiency.
4. CONCLUSION
This paper presents the significance of text mining and study of techniques used for text mining.
Organized Structure with arrangement and clustering techniques are also presented in the survey. The survey
paper also include the information of the different data mining algorithm which will give the detailed information
about the text mining and it’s also clarify the advantages and disadvantages of the data mining. The application
of different text mining techniques for unstructured informational collections are reside in the form of text
documents. The kind of techniques are permits making a best web engine utilizing database learning to work with
filter, wrapper or even ontology. It also described open areas and testing issues explore directions in text mining.
7. Volume 1, Issue 1 (2018)
Article No. 7
PP 1-7
7
www.viva-technology.org/New/IJRI
REFERENCES
[1] Jing An, Qi Kang, Lei Wang, Qidi Wu "Mussels Wandering Optimization: An Ecologically Inspired Algorithm for Global
Optimization" IEEE International Conference on Networking, Sensing and Control.
[2] Peng Yan, ShiYao Lui, Bing zyao Huang "A Data Clustering Algorithm Based on Mussels Wandering Optimization" IEEE
International Conference 2014.
[3] Ms.Sonam Tripathi, Asst prof.Tripathi Sharma."A Survey Paper for Finding Frequent Pattern In Text Mining" International
Journal of Advanced Research in Computer Engineering &Technology(IJRCET)
[4] Ahmed A. Abusnaina, Rosni Abdullah. "Mussels Wandering Optimization Algorithm Based Trainning of Artifical Neural
Networks For Pattern Classification” International Conference on Computing and Information.(ICOCI)2013
[5] Rashmi P. Dagde, Snehlata Dongre “A Review on Clustering Analysis based on Optimization Algorithm for Data mining”.
IJCSN International Journal of Computer Science and Network, Volume 6, Issue 1, February 2017.
[6] Zhang Zhen, Chen Chao, Chen jun-liang “Algorithm of Group Members' consensus orienting to Discussion Dynamic Process”.
IEEE Transaction.
[7] Victor Solo “Stability of Distributed Adaptive Algorithms I: Consensus Algorithms” IEEE Transaction 2015.
[8] Chiabwoot Ratanavilisagul and Boontee Kruatrachue “A Modified Particle Swarm Optimization with Dynamic Particles Re-
initialization Period”. Springer International Publishing Switzerland 2014.
[9] Ramzan Talib, Muhammad kashif Mani, Shaeela Ayesha, Fakeeha Fatima, “Text Mining: Techniques, application and issues”,
IJACSA(2016)
[10] Sarkar, Arindam Roy & B.S Purkayastha,” A comparative Analysis of particle swarm optimization and k-mean Algorithm for
Text clustering using Nepali wordnet”, IJNLC(June 2014)
[11] Jayshree Ghorpade-Aher, Roshan Bagdiya,”Review on clustering web data using pso”, International Journal of computer
application( December 2014)
[12] Yu Zhuang, YuMau, Xinchen, “A limited Iteration Bisecting k-means for fast clustering large datasets”, IEEE trust com(2016)
[13] Rekha Dahiya, Anshima Singh, “A survey on application of particle swarm optimization in Text Mining”, International Journal
of Innovative research & development(May 2014)
[14] Nikita P. Katariya, Prof. M. S. Chaudhari “Bisecting K-means Algorithm for Text Clustering”. IJARCSSE February 2015.