Written information comprises a gold mine for analyzing and understanding many human behaviors that may not be evident. For instance, knowing the extent of subjects’ vocabulary or the average length of the words they use when writing may be correlated with their education level. Likewise, information from spelling errors, punctuation marks, emoticons and the way upper/lowercase letters are used can be indicative of the age of the writer. In fact, nowadays it is possible, to some extend, to determine the gender, age, occupation and place of residence of people by simply analyzing writing patterns.
Pertaining the forensic analysis instrument of TeSLA, the interest is on discovering patterns that can capture the writing style of authors in such a way that authorship can be confirmed. But, what are the patterns that determine the writing style of a subject?, and how can these patterns be exploited to determine the authorship of documents?
Generally speaking, it is well known in the Languages Technologies community that character n-grams are among the most useful patterns when modelling writing style. These are nothing but sequences of characters of size n. The effectiveness of character n-grams lies in the fact that they can account for the occurrence of phenomenons that can be associated to writing style, and that vary across subjects. For instance, character n-grams can capture spelling errors, the way in which upper/lower case letters are used, punctuation marks, verbal forms (e.g., whether an author uses the gerund form), abbreviations (e.g., lol, ok, fyi, etc) and other linguistic patterns that may be helpful to characterize writing style. The frequency of occurrence of these patterns, combined with artificial intelligence techniques makes it possible for the forensic instrument to determine authorship of documents.
Languages Technologies and Natural Language Processing provide tools for mining the gold that textual information conveys. Thanks to progress in such fields, we can extract valuable information from users that can be used to enhance everyone’s life, just like TeSLA does. Yet, we should be aware of privacy policies and ethics of every single place we post written information, as flaws in such aspects may allow unauthorized subjects to know information that can also be used for malicious purposes.
Hugo Jair Escalante, Manuel Montes, Pastor López (INAOE team)
FUNDED BY THE EUROPEAN UNION
TeSLA is not responsible for any contents linked or referred to from these pages. It does not associate or identify itself with the content of third parties to which it refers via a link. Furthermore TESLA is not liable for any postings or messages published by users of discussion boards, guest books or mailing lists provided on its page. We have no control over the nature, content and availability of any links that may appear on our site. The inclusion of any links does not necessarily imply a recommendation or endorse the views expressed within them.
TeSLA is coordinated by Universitat Oberta de Catalunya (UOC) and funded by the European Commission’s Horizon 2020 ICT Programme. This website reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein.