ISSN 2394-5125
 

Review Article 


AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA

JEBAMALAI ROBINSON1*, V. SARAVANAN.

Abstract
In research problems associated with text mining and classification, many factors have to be considered as on what basis the classification needs to be done. These factor variables are termed as features. The hardness of the visualization of training data is directly based on the number of features. Most of the times, the features are found to have high correlation and redundant. Dimensionality reduction helps to reduce the number of these features under the task by accumulating a group of principle variables. In the previous work an automated feature extraction technique using the weighted TF-IDF was proposed. Although the proposed method performed well, there was a drawback that some of the features generated are correlated to each other which resulted in high dimensionality resulting in more time complexity and memory usage. This paper proposes an Automatic text summarization method using the weighted TF-IDF model and K-means clustering for reducing the dimensionality of the extracted features. The various similarity measures are utilized in order to identify the similarity between the sentences of the document and then they are grouped in cluster on the basis of their term frequency and inverse document frequency (tf-idf) values of the words. The experiments were carried out on the student text data from the US educational data hub and the results were compared with other dimensionality reduction methods in terms of co-selection, content based, weight based and term significance parameters. The proposed method found to be efficient in terms of memory usage and time complexity.

Key words: Text Mining, Classification, Dimension Reduction, Text Summarization, Weighted TF-IDF and K-Means Clustering .


 
ARTICLE TOOLS
Abstract
PDF Fulltext
How to cite this articleHow to cite this article
Citation Tools
Related Records
 Articles by JEBAMALAI ROBINSON1*
Articles by V. SARAVANAN
on Google
on Google Scholar


How to Cite this Article
Pubmed Style

ROBINSON J, SARAVANAN V. AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA. JCR. 2020; 7(1): 135-140. doi:10.22159/jcr.07.01.24


Web Style

ROBINSON J, SARAVANAN V. AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA. http://www.jcreview.com/?mno=302645209 [Access: September 14, 2020]. doi:10.22159/jcr.07.01.24


AMA (American Medical Association) Style

ROBINSON J, SARAVANAN V. AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA. JCR. 2020; 7(1): 135-140. doi:10.22159/jcr.07.01.24



Vancouver/ICMJE Style

ROBINSON J, SARAVANAN V. AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA. JCR. (2020), [cited September 14, 2020]; 7(1): 135-140. doi:10.22159/jcr.07.01.24



Harvard Style

ROBINSON, J. & SARAVANAN, . V. (2020) AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA. JCR, 7 (1), 135-140. doi:10.22159/jcr.07.01.24



Turabian Style

ROBINSON, JEBAMALAI, and V. SARAVANAN. 2020. AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA. Journal of Critical Reviews, 7 (1), 135-140. doi:10.22159/jcr.07.01.24



Chicago Style

ROBINSON, JEBAMALAI, and V. SARAVANAN. "AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA." Journal of Critical Reviews 7 (2020), 135-140. doi:10.22159/jcr.07.01.24



MLA (The Modern Language Association) Style

ROBINSON, JEBAMALAI, and V. SARAVANAN. "AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA." Journal of Critical Reviews 7.1 (2020), 135-140. Print. doi:10.22159/jcr.07.01.24



APA (American Psychological Association) Style

ROBINSON, J. & SARAVANAN, . V. (2020) AN EXTRACTIVE BASED MULTI-DOCUMENT SUMMARIZATION USING WEIGHTED TF-IDF AND CENTROID BASED K-MEANS CLUSTERING (TF-IDF: CBC) FOR LARGE TEXT DATA. Journal of Critical Reviews, 7 (1), 135-140. doi:10.22159/jcr.07.01.24