On the Characterization of Natural Language Structure and Literary Stylometry - A Network Science Approach
Abstract
Natural language processing (NLP) techniques have been through many advancements in recent years, linguistics and scientist utilized these techniques to solve
many challenges related to written language and literary. Problems such as finding
the genetic relationships among languages, attributing author of a text and categorizing text by genre have been treated throughout the years using conventional
statistical methods, for instance, bag of words (BoW), N-gram, the frequency of
words and the lexical distance between words. By considering written language
as a complex system, network science tools and techniques can be used to address
those problems. A unified methodology is proposed in this dissertation to achieve
this task by (i) Propose a framework for characterizing written language as a complex system; (ii) Define three language related fields that need to be addressed by
the proposed methodology; and (iii) For each field: Review related literature to
get a solid background of the subject; Collect and process the data then construct
the networks; Extract network measures and statistics to build the dataset; Deploy machine learning algorithms to cluster, classify the datasets; Compare and
contrast results obtained with one from traditional methods.