当前位置：文档库 › 基于朴素贝叶斯和改进最大熵分类器的文本分类(IJITCS-V8-N9-5)

基于朴素贝叶斯和改进最大熵分类器的文本分类(IJITCS-V8-N9-5)

I.J. Information Technology and Computer Science, 2016, 9, 32-38

Published Online September 2016 in MECS (https://www.wendangku.net/doc/2714023759.html,/) DOI: 10.5815/ijitcs.2016.09.05

Combining Na?ve Bayes and Modified Maximum

Entropy Classifiers for Text Classification

Hanika Kashyap

R. N. Modi Institute of Technology, Kota, India

E-mail: hanikakashyap5@https://www.wendangku.net/doc/2714023759.html,

Dr. Bala Buksh

R. N. Modi Institute of Technology, Kota, India

E-mail: balabuksh@https://www.wendangku.net/doc/2714023759.html,

Abstract —Text Classification is done mainly through classifiers proposed over the years, Na?ve Bayes and Maximum Entropy being the most popular of all. Howev-er, the individual classifiers show limited applicability according to their respective domains and scopes. Recent research works evaluated that the combination of classifi-ers when used for classification showed better perfor-mance than the individual ones. This work introduces a modified Maximum Entropy-based classifier. Maximum Entropy classifiers provide a great deal of flexibility for parameter definitions and follow assumptions closer to real world scenario. This classifier is then combined with a Na?ve Bayes classifier. Na?ve Bayes Classification is a very simple and fast technique. The assumption model is opposite to that of Maximum Entropy. The combination of classifiers is done through operators that linearly co m-bine the results of two classifiers to predict class of doc-uments in query. Proper validation of the 7 proposed modifications (4 modifications of Maximum Entropy, 3 combined classifiers) are demonstrated through imple-mentation and experimenting on real life datasets. Index Terms —Text Classification, Combination of clas-sifiers, Na?ve Bayes Classifier, Maximum Entropy Clas-sifier, Accuracy. I. I NTRODUCTION

The amount of text available for analysis has increased

hugely in recent years due to the social networking, micro blogging and various messaging/ bulletin board systems.

Besides these, many articles, news feeds and documents

are now available in soft copy. An important step in text classification is to classify the text documents among

some known set of classes/ categories. The task of data

mining can be done through two processes - Classification and Clustering. While clustering is an unsupervised learn-ing approach, classification is a supervised form of ma-chine learning. It helps to classify the given text in differ-ent categories using efficient classification algorithms.

The classification process in itself is a very detailed pro-cess consisting of various stages. Each stage then has a set of methods to choose from depending on the text and

the given classification problem. The final stage is the

classification stage. Algorithms, called Classifiers, are trained using documents already classified (manually) and then used to predict the category of a new text docu-ment. Over the years, many classification algorithms have been proposed, out of which Na?ve Bayes [1], Maximum Entropy [2], K-Nearest Neighbor (KNN) [3] and Support

Vector Machines (SVM) [4] are commonly used till now.

Each classifier is restricted to its scope of classification which makes it difficult to effectively classify the given text. Extensions of the classifiers were also proposed to overcome the drawbacks. Yigit and Baykan [5] used KNN classifier for detecting news related to Turkey among the different news chan-nels. The classification process carried out by the KNN classifier was found to be 90% accurate. Similarly, Na?ve Bayes outperforms SVMs for Authorship attribution in

[6]. Such research works bring us to the conclusion that

each classifier works well only on specific applications. Therefore, each upcoming application will have to be tested against various available classifiers to find which

classifier works well. A generalized classification algo-rithm is therefore needed which suits to every application. Hybridization of classifiers in order to bring about the

best of combined classifiers is found to be a promising approach in this direction. The results of combinations when compared to the results of the individual classifiers

are visibly better which give a boost to this area of re-search. The effects of combination can be very well seen in [7,8,9,10,11,12,13,14].

In this paper, Na?ve Bayes [1] and Maximum Entropy

classifiers [2] are considered for combination for the pur-pose of text classification. Whereas Na?ve Bayes is e x-tremely simple, the Maximum Entropy classifier provides

great flexibility and uniformity. The assumption models of both differ completely. Na?ve Bayes assumes total

independence between words in the document (which is

realistically impossible) unlike Maximum Entropy class i-fier which is approximate to the real world scenarios.

Modifications to the traditional Maximum Entropy class i-fier are proposed making it more efficient and then the

modified versions of Maximum Entropy Classifier are

combined with the Na?ve Bayes classifier using three

merging operators- Max, Average and Harmonic Mean. The performance is measured on different datasets such

that no individual classifier has clearly better perfor-mance over all of them.

II.R ELATED W ORK

This section reviews some relevant hybrid approaches for the purpose of text classification. Recent research works [7,8,9,10,11,12,13,14] in the direction of combin-ing classifiers for text classification assure that combina-tion is always better than using individual classifiers.

As early as in 1996, Larkey and Croft [7] propose the combination of three classifiers, KNN, Relevance feed-back and Bayesian’s independence classifiers to be used in the medical domain for automatic assignment of ICD9 codes. The task was done first with individual classifiers and then with combined to check the effectiveness of both the approaches and the hybrid approach was con-cluded better. The performance of the classifiers were measured based on document ranks. This is an example where classifiers are used for document ranking. The approach is of using weighted linear combination. Bennett, Dumais and Horovitz [8] proposed a probabil-istic method to combine the classifiers such that the con-tribution of a classifier depends on its reliability. The reliability is measured through reliability indicators which are linked to the regions where a classifier might perform relatively good or poor. Instead of the rank of document, the indicators are based on performance of the classifier itself thus making the proposal more genera l-ized.

In 2004, Grilheres, Brunessaux and Leray [9] pub-lished detailed study of effect of combin ing classifiers to classify multimedia documents into heterogeneous clas-ses. Various combinations are applied to a five thousand webpages document of the European Research Project NetProtect II and experiment results prove that with a prior knowledge on classifiers, better filtering perfor-mances are possible. The approaches used for combining are both voting-based and logic-based.

Besides the conventional style of linear or voting based combination a new technique based on Dampster-Shafer theory was proposed by Sarinapakkorn and Kubat [10]. Their main aim is fusion of sub-classifiers since the ap-plication is towards multi-label classification.

Isa et al in their two successive papers [11] and [12] have proposed a novel idea as to how meta-outputs of a Na?ve Bayes technique can be used with SVM and Self-organizing maps (SOM) respectively. Bayes formula is used to convert the text document into a vector space where the values denote the probabilities of documents towards any class depending on the features contained. This is called the vectorisation phase of the classifier. It is common to both the classifiers. SVM is then applied on this vector space model for final classification output. The proposal had improved classification accuracy co m-pared to the pure naive Bayes classification approach. In [12] the probability distributions obtained by Bayes tech- nique are followed by an indexing step done through SOM to retrieve the best match cases. SOM is similar to clustering of documents based on a similarity measure between the documents like Euclidean distance.

Miao et al [13] considered very different combination of classifiers, namely KNN and Rocchio methods. A var-iable precision rough set is used to partition the feature space to lower and upper bounds of each class. Each sub-space is classified through Rocchio technique. But it fails when the arriving document is in boundary region, here kNN is used. This presents a new style of combining classification methods to overcome each others’ dra w-backs.

A more recent research by Fragos, Belsis and Skourlas

[14] also concludes in favor of combining different ap-proaches for text classification. The methods that authors have combined belong to same paradigm – probabilistic. Na?ve Bayes and Maximum entropy classifiers are chosen to test on the applications where the individual perfor-mance is good. The merging operators are used above the individual results. Maximum and Harmonic mean opera-tors have been used and the performance of combination is better than the individual classifiers.

Keretna, Lim and Creighton [15] have worked on rec-ognizing named entities from a medical dataset contain-ing informal and unstructured text. For this, they combine the individual results of Conditional Random Field (CRF) classifiers and Maximum Entropy (ME) classifiers on the medical text; each classifier trained using a different set of features. CRF concentrates on the contextual features and ME concentrates on the linguistic features of each word. The combined results were better than the individ-ual results of both the classifiers based on Recall rate performance measure.

Ramasundaram [16] aimed to improve the N-grams classification algorithm by applying Simulated Annealing (SA) search technique to the classifier. The hybrid classi-fier NGramsSA brought about an improvisation to the original NGrams classifier while inheriting all the ad-vantages of N-grams approach. Feature reduction using method is used but its multivariate value among the n-grams affects the performance of the classifier.

III.P ROPOSED C LASSIFIERS

This section di scusses the document model used for representing the text documents, the modified classifiers considered for the combination and the proposed classifi-cation process.

A. Representing Document Model

For representing documents, term frequency matrix is used which tells the number of times a particular term has appeared in the document. Each document is M-tuple of values, where each value is frequency of the term occur-ring in the document, that is ??as shown in the following matrix

[]

The notations to be taken care of here are discussed be-low.

? represents the number of classes;

? represents the number of features/ terms;

? represents the number of documents in the train-ing set;

? represents the number of documents in the tes t-ing set;

? represents the total number of documents; that is ;

B. Na?ve Bayes Classifier

The Naive Bayes classifier [1] is considered one of the simplest of probabilistic models showing how the data is generated with the following assumption

―Given the context of the class, all attributes of the text are independent to each other."

This technique starts by taking text documents as word counts. It calculates the class conditional probability fo l-lowed by the classification probability or posterior proba-bility to be used by the trained classifier to predict the class of any document.

For every term and class , the class conditional probability ?()considering only one training set is given as follows:

?( ) =

?()∑()

∑

(1)

Where,

∑()The sum of raw term frequencies of word from all documents in the training sample that belong to class .

: An additive smoothing parameter

∑: The sum of all term frequencies in the train-ing dataset for class .

The posterior probability of a document belonging to any class . is the product of individual class-conditional probabilities of all terms contained in the query document.

()?()?()?()

∏?()() (2)

Once all these probabilities have been computed, the maximum probability towards a class indicates that query document belongs to class .

argmax() (3)

C. Maximum Entropy Classifier

Maximum Entropy classifier [2] believes in the princi-ple that the model generating the training set should be the most uniform among the other models and all con-straints from the training set should be satisfied in the model.

Let, ( ) be the feature function of the document with the class;

()be the required probability that assigns class c to document d;

and ?( ) be the empirical probability distribution; Then, maximum entropy principle says that expected value of ( ) is same for both ( ) and ?( ).This can be called a constraint which makes

()

( )

exp[∑()] (4)

Here,

()∑exp[∑()]is normalization factor and is weight for each feature ()

D. Modified Maximum Entropy Classifier

The ME Classifier is modified in three aspects-weights. The weights can be computed using any of the weighting methods Gini Index, Chi square, CMFS or DIA instead of the conventional method of optimizing the objective function in ME Classifier. These weighting methods are discussed below:

?Gini Index

Suppose is the sample set which belongs to class , is the sample number of set , then the Gini index of set S is:

()∑(( )) (6) ?Chi-Square (CHI)

Chi-square formula is defined as follows:

()

( )

(7) Where is the amount of documents in the training set;

is the frequency with which feature occurs in the category ; is the frequency with which feature occurred in all categories except ; is the frequency with which category occurs and does not contain fea-ture ; is the number of times neither nor occurs.

?DIA Association Factor (DIA)

The DIA association factor is defined by

()() (8)

where()refers to the conditional probability that feature belongs to category when the feature oc-curs.

?Comprehensive Measurement Feature Selection (CMFS)

()()() (9) ?Global Feature Selection (GFS)

This feature selection algorithm is the global version of CMFS and involves sum of CMFS for all classes thereby improving the performance of the classification. It is defined by

()∑()() (10) Features are computed as feature contribution to-wards a class using

()

(11)

The prediction probability of ME classifier has been modified as

(|)[∑()()()]

∑[∑()()()]

(12)

Our proposed modification involves multiplication of weights () and feature function () with term frequen-cy () for calculation of prediction probability unlike the conventional method used for ME Classifier.

E. Proposed Combined Classifiers

Stage by stage representation of the classification pro-cess is illustrated in Fig.1.

The proposed classification process consists of the fol-lowing stages

?Preprocessing stage

?Feature Extraction stage

?Individual Classification stage

?Combining Classification stage

?Final Results

The classification process starts with preprocessing of text to make it ready for the classification followed by extracting relevant features through Global Feature selec-tion (GFS) method. It ranks the features according to their importance and the top K features are extracted. After the feature extraction process, Na?ve Bayes (NB) and Maximum Entropy (ME) classifiers are used individ-ually for classification. The later stage combines both the classifiers using three combination operators: Average, Harmonic Mean and Max. Combining operators are used for compensation of errors in each classifier and perfor-mance improvement. Equations (13), (14) and (15) show the Average, Max and Harmonic Mean operators respec-tively.

Fig.1. Classification Process

()avg ( ) ( ) (13)

()max ( ) ( ) (14)

()(()( ))

( ()())

(15)

The results of the combination then give the final result. Formally, the proposed classification algorithm can be detailed as:

ALGORITHM: CLASSIFICATION

INPUT: Training data DR as term frequencies and class labels and test document d, number of classes C OUTPUT: Predict class C for document d

Step 1: Train NB Classifier’s class conditional prob a-bility using and class labels as per Eqn(1)

Step 2: Compute posterior class probability of NB,

for every class as per Eqn(2)

Step 3: Train ME Classifier: Compute using any of the weighing schemes CHI-Square, DIA factor, Gini In-dex or CMFS and feature function as in Eqn(11)

Step 4: Compute posterior probability using ME,

for every class using Eqn(12)

Step 5: Combining Results: Compute for every class, probability that d belongs to as

T Dataset Preprocessing Stage

Feature Extraction

Combining Operators

Final Result

Combined answers

Modi-

fied

Max-

Naive

Bayes

(|)COMB(()()) Where COMB represents any one of the operator Max, Harmonic Mean or Average

Step 6: Predict class of d as class for which (|) is maximum by Eqn(3).

IV.I MPLEMENTATION R ESULTS

A. Classification Performance

For the evaluation and comparison purposes, various classifiers and their combinations are used as listed in Table 1 with their codes to be used in the paper. The combined classifiers used by Fragos et al in [14] have also been implemented for comparison.

Table 1. List of Codes with their Details

The real life datasets used are Reuters-21578 and the DB-World Stemmed Datasets. Both are available in the UCI Machine Repository [17]. The Reuters dataset is a collection of documents that appeared in Reuters News-wire in 1987; assembled and indexed with categories with 21578 instances. The DB world datasets, first focusing on email bodies and the other on email subjects, are already stemmed and collect 64 emails from the DB-World mail-ing list to classify betwee n ―announcements of confer-en ces‖ and ―everything else‖.

The two datasets are reduced in dimensions extracting K number of top features ranked using GFS. The accura-cy of the classifiers individually and in combination is noted. The value of K is varied from 20 till 2000. Instead of using the popular ModApte split, a uniform distribu-tion is considered where 70% of documents in each cate-gory are included in the training set and remaining 30% for test set.

Table 2 shows the results of experiments on the Reu-ters dataset for K= 20, 50, 100, 200, 500 respectively. With increasing values of K, the results seem to have no observable effect on accuracy of Maximum entropy clas-sifier. The Na?ve Bayes classifier shows a linear increase in the accuracy obtained, hence the combination which selects maximum from the two performs better on this dataset.

Table 2. Implementation Results on Reuters Dataset

The DB world Bodies dataset is split into a uniform distribution of 80:20 where 80% of documents in each category are included in the training set and remaining 20% for test set. Results on DB World Bodies stemmed Da-taset are illustrated in Table 3 for K=20; 50, 100; 200, 500, 700; 1000; 1500, 2000.In this case, the performance of Maximum Entropy classifier is better than Na?ve Bayes. Thus, the combination which picks maximum of two gives a better result.

Table 3. Implementation Results for DB World Bodies Dataset

Table 4 shows the results of implementations on DB World Subjects Dataset. The splitting ratio is same as that of DB World Bodies Dataset that is 80% documents in the training set and 20% in the test set. Since the dataset consists of 229 features, the value of K is varied from 20 till 200. This is the case where performance of both the classifiers is equivalent. Hence, all combination operators provide same result.

Table 4. Implementation Results for DB World Subjects Dataset

Among the Na?ve Bayes and Maximum Entropy Clas-sifiers tested individually, Na?ve Bayes Classifier i.e. Classifier NB gives the best results in most of the cases in spite of its assumption of total independence of words in the document which does not actually happen in real world scenarios. The results from the experiments con-cluded that the proposed combination classifier using Max operator, that is Combo Classifier MNBME gives the best results.

B. Run- Time Analysis

Fig.2. Growth of training time for increasing K

Fig.3. Growth of testing time for increasing K

The modifications in ME are proposed for overcoming the drawbacks of the original ME classifier related to high training time complexity. The training time co m-plexity can be empirically observed by plotting training time and testing time for increasing number of features selected which actually increases the amount of work done in training. Fig 2 and 3 show the plot of training time and testing time respectively for K varying from 20 till 500 for Classifiers MEC, MED and MEG. The growth observed in both the plots is linear in K and the shape of the plots is 3-piece linear because the increment in K is not consistent.

Thus, it can be claimed that the proposed modifications to the ME classifier does not affect the overall complexity of the classifier. Also, growth of runtime does not depend on the method used for weighing features.

V.C ONCLUSION

In this paper, combination of classifiers with some ma-jor modifications has been done. Na?ve Bayes is com-bined with modified Maximum Entropy classifiers; Na?ve Bayes for its simplicity and Maximum Entropy classifier for its flexibility and appropriateness to the real world scenarios. Both the classifiers are opposite with respect to the assumption model; the former is a totally independent model while the latter considers entropy relation among terms. The modified versions of Maximum Entropy clas-sifiers have the original Maximum Entropy classifiers with new methods for the computation of weights, feature functions and prediction probability. The task of splitting datasets is done by distributing a specified percentage of documents in the training set with the remaining docu-ments in the test set. The ratio of distribution may or may not vary for different datasets. The modified versions of Maximum Entropy classifier are combined with Na?ve Bayes using any of the Average, Max or Harmonic Mean operators. The results of the implementations individually and in com bination are compared with Fragos et al’s work.

The datasets for experiments have been selected such that in few cases Na?ve Bayes performs better than Modi-fied Maximum Entropy classifier and the opposite in few; while for others both classifiers have equivalent perfor-mance. Given any such case, the proposed combination classifier with Max combining operator gives the best accuracy.

R EFERENCES

[1] D. Lewis, ―Naive Bayes at Forty:The Independence As-

sumption‖, Information Retrieval. Proc. ECML-98, 10th European Conf. Machine 1998.

[2]K. Nigam, J. Lafferty, and A. McCulllum, ―Using Maxi-

mum Entropy for Text Classification‖,IJCAI-99, Work-shop on Machine learning for Information Filtering, pgs 61-67, 1999.

[3]G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, ―KNN

Model-Ba sed Approach in Classification‖,Proc.

ODBASE pp- 986 – 996, 2003.

[4] C. Basu and M. S. Waters, ―Support Vector Machines for

Text Categorization‖, Proc. 36th Annual Hawaii Interna-

tional Conference on System Sciences, 2003.

[5]FerruhYi?it, ?mer Kaan Bayka, ―Detection of The News

about Turkey with Web-based Text Mining System‖, In

ternational Journal of Information Technology & Com-puter Science (IJITCS, Volume 11, Issue No: 2, pp.56-6-,

2013.

[6]

Fatma Howedi and Masnizah Mohd, ―Text Classification for Authorship Attribution Using Naive Bayes Classifier with Limited Training Data‖, Computer Engineering and Intelligent Systems, Vol.5, No.4, 2014.

[7]

L.S. Larkey. and W. B. Croft, ―Combining classifiers in text categorization‖, Proc. SIGIR-96, 19th ACM Interna-tional Conference on Research and Development in In-formation Retrieval (Zurich, CH, 1996), pp. 289–297 1996.

[8]

Paul N. Bennett, Susan T. Dumais, Eric Horvitz, ―Prob a-bilistic Combination of Text Classi?ers Using Reliability Indicators: Models and Results‖, Proceedings of 25th A n-nual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Fin-land, August 2002. ACM Press.

[9]

B. Grilheres, S. Brunessaux, and P. Leray, ―Combining classifiers for harmful document filtering‖, RIAO '04 Coupling approaches, coupling media and coupling lan-guages for information retrieval, Pages 173-185, 2004. [10]

Kanoksri Sarinnapakorn and Miros lav Kubat, ―Combining Subclassifiers in Text Categorization: A DST-Based Solu-tion and a Case Study‖, IEEE Transactions On Knowledge And Data Engineering, Vol. 19, No. 12, De-cember 2007.

[11]

Dino Isa, Lam Hong lee, V. P Kallimani, and R. Raj Ku-mar, ―Text Docume nts Preprocessing with the Bayes Formula for Classification using the Support vector ma-chine‖, IEEE Transactions of Knowledge and Data Eng i-neering, vol.20, no. 9, pp.1264-1272, September 2008. [12]

Dino Isa, V. P Kallimani and Lam Hong lee, ―Using Self Organizing Map for Clustering of Text Documents‖, E x-pert System with Applications, vol. 36, no. 5, pp. 9584-9591, July, 2009.

[13]

Duoqian Miao , QiguoDuan, Hongyun Zhang, and Na Jiao, ―Rough set based hybrid algorithm for text classification‖, Journal of Expert Systems with Applications, vol. 36, no. 5, pp. 9168-9174, July 2009.

[14]

K. Fragos, P. Belsis, and C. Skourlas, ―Combining Proba-bilistic Classifiers for Text Classification ‖, Procedia - So-cial and Behavioral Sciences, Volume 147 Pages 307–312, 3rd International Conference on Integrated Infor-mation(IC-ININFO), doi: 10.1016 /j.sbspro .2014.07. 098, 2014.

[15]

S. Keretna, C. P. Lim and D. Creighton, ―Classification Ensemble to Improve Medical Named Entity Recogni-tion ‖, 2014 IEEE International Conference on Systems, Man, and Cybernetics, San Diego, CA, USA, 2014.

[16]

S. Ramasundaram, ―NGramsSA Algorithm for Text Cat e-gorization‖, International Journal of Information Techno l-ogy & Computer Science ( IJITCS ), Volume 13, Issue No : 1, pp.36-44, 2014.

[17]

UCI Machine Learning Repository, https://www.wendangku.net/doc/2714023759.html,/ mlearn/MLRepository.html

Authors ’ Profiles

Hanika Kashyap is a lecturer and researcher at R. N. Modi Engineering College, Raja-sthan, India since February 2015. She ob-tained her certificate of engineer in computer science in 2012.

She continues her Masters’ degree at Raj a-sthan Technical University, Rajasthan, India.

Her research interests include clustering, operating system simulation and cryptography.

Dr. Bala Buksh received his MSc in CS from Garhwal University, M.S. in CS from Bir-mingham University, U. K. Doctor degree in Services for Activities in Group Editing from Birmingham, U. K.

In 1991, he joined Oil and Natural Gas Corporation of India where he worked until

2010 on various activities such as Air force team in planning, designing & implementation of SCADA Project (Remote Plant Monitoring and Control) System for Bombay offshore produc-tion platforms, setting-up a new Computer Centre for Drilling Applications and implemented Drilling Operation Monitoring Information System, involved in IT infrastructure standardiza-tion, setting up EPINET (Exploration & Production Network), system installation and implementation as Head EPINET at Corporate Centre Dehra Dun and looking after the duties of Chief EPINET Coordinator, implemented LIBNET project con-necting 35 Libraries of ONGC for sharing resources, and im-plemented digitization & soft copying of E&P Physical Intellec-tual Assets project of sharing of information across the Organi-zation. Presently he is holding a post of Director, R. N. Modi Engineering College, Kota since 2011.

Some of his eminent publications include Data Warehousing and Knowledge Management - Paper (Petro-tech 2007), Double Encryption Using FHE for Tamper Deduction in Incremental Documents (ICTIS November 2015 and Similarity Measures Used in Data Mining, 29th National Conference (ETICE – Feb-ruary 2015) etc. His research interests include Computer Net-works, large databases and ERP systems.

Dr. Buksh is a life member of CSI and has been Chairman and Vice Chairman of CSI Dehra Dun & Vadodara Chapter respectively.

How to cite this paper: Hanika Kashyap, Bala Buksh, "Com-bining Na?ve Bayes and Modified Maximum Entropy Classifi-ers for Text Classification", International Journal of Information Technology and Computer Science (IJITCS), Vol.8, No.9, pp.32-38, 2016. DOI: 10.5815/ijitcs.2016.09.05