ISSN 2394-5125
 

Research Article 


ON CREATION OF DOGRI LANGUAGE CORPUS

Sonam Gandotra, Bhavna Arora.

Abstract
The pre-requisite for any Natural Language Processing (NLP) task, is the corpus. Corpus is defined as a large
collection of structured text. Dogri is one of the official languages of India but is under-resourced in terms of computational
resources needed for any NLP task. This paper proposes a methodology to construct a standard corpus which can be used for
performing various language processing tasks like stemming, part-of-speech tagging, information retrieval, etc. The digitized
text required for creating the corpus is not available due to the scarcity of online resources containing Dogri text. The only
online source which is available is the Dogri Newspaper “Jammu Prabhat". Hence, the text is to be extracted from portable
document formats (pdf) of that newspaper which are first converted to images before extraction of the text. To achieve this,
an open-source tool-Tesseract is used for extracting the text from images. The methodology that is used for the corpus
creation of Dogri Language is discussed in detail in the paper. The challenges faced during the research and the acquired
results have also been discussed.

Key words: Corpus, Tesseract-OCR, Dogri Language, Language Resource.


 
ARTICLE TOOLS
Abstract
PDF Fulltext
How to cite this articleHow to cite this article
Citation Tools
Related Records
 Articles by Sonam Gandotra
Articles by Bhavna Arora
on Google
on Google Scholar


How to Cite this Article
Pubmed Style

Sonam Gandotra, Bhavna Arora. ON CREATION OF DOGRI LANGUAGE CORPUS. JCR. 2020; 7(9): 2337-2343. doi:10.31838/jcr.07.09.380


Web Style

Sonam Gandotra, Bhavna Arora. ON CREATION OF DOGRI LANGUAGE CORPUS. http://www.jcreview.com/?mno=107375 [Access: May 30, 2021]. doi:10.31838/jcr.07.09.380


AMA (American Medical Association) Style

Sonam Gandotra, Bhavna Arora. ON CREATION OF DOGRI LANGUAGE CORPUS. JCR. 2020; 7(9): 2337-2343. doi:10.31838/jcr.07.09.380



Vancouver/ICMJE Style

Sonam Gandotra, Bhavna Arora. ON CREATION OF DOGRI LANGUAGE CORPUS. JCR. (2020), [cited May 30, 2021]; 7(9): 2337-2343. doi:10.31838/jcr.07.09.380



Harvard Style

Sonam Gandotra, Bhavna Arora (2020) ON CREATION OF DOGRI LANGUAGE CORPUS. JCR, 7 (9), 2337-2343. doi:10.31838/jcr.07.09.380



Turabian Style

Sonam Gandotra, Bhavna Arora. 2020. ON CREATION OF DOGRI LANGUAGE CORPUS. Journal of Critical Reviews, 7 (9), 2337-2343. doi:10.31838/jcr.07.09.380



Chicago Style

Sonam Gandotra, Bhavna Arora. "ON CREATION OF DOGRI LANGUAGE CORPUS." Journal of Critical Reviews 7 (2020), 2337-2343. doi:10.31838/jcr.07.09.380



MLA (The Modern Language Association) Style

Sonam Gandotra, Bhavna Arora. "ON CREATION OF DOGRI LANGUAGE CORPUS." Journal of Critical Reviews 7.9 (2020), 2337-2343. Print. doi:10.31838/jcr.07.09.380



APA (American Psychological Association) Style

Sonam Gandotra, Bhavna Arora (2020) ON CREATION OF DOGRI LANGUAGE CORPUS. Journal of Critical Reviews, 7 (9), 2337-2343. doi:10.31838/jcr.07.09.380