A sentence is typically treated as the minimal syntactic unit used to extract valuable information from long text. However, in written Thai, there are no explicit sentence markers. Some prior works use machine learning; however, a deep learning approach has never been employed. We propose a deep learning model for sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investig...
�� 2020. Published by ACL. This is an open access article available under a Creative Commons licence...
Myanmar sentences are written as contiguoussequences of syllables with no characters delimiting thew...
In this work we address the problems of sentence segmentation and tokenization. Informally the task ...
Word segmentation is a problem in several Asian languages that have no explicit word boundary delimi...
Thai is a low-resource language, so it is often the case that data is not available in sufficient qu...
�� 2021 The Authors. Published by ACL. This is an open access article available under a Creative Com...
The aim of this thesis is to design and implement a computational linguistic module for analysing Th...
Unlike English, there is no explicit sentence marker in Thai language. Conventionally, a space is pl...
For languages without word boundary delimiters, dictionaries are needed for segmenting running texts...
A Thai written text is a string of symbols without explicit word boundary markup. A method for a dev...
The sentence segmentation task is the task of segmenting a text corpus into sentences. Segmenting we...
In Thai language, the word boundary is not explicitly clear, therefore, word segmentation is needed ...
The Thai written language is one of the languages that does not have word boundaries. In order to di...
International audienceTokenization is widely regarded as a solved problem due to the high accuracy t...
Abstract This paper discusses a Thai corpus, TaLAPi, fully annotated with word segmentation (WS), pa...
�� 2020. Published by ACL. This is an open access article available under a Creative Commons licence...
Myanmar sentences are written as contiguoussequences of syllables with no characters delimiting thew...
In this work we address the problems of sentence segmentation and tokenization. Informally the task ...
Word segmentation is a problem in several Asian languages that have no explicit word boundary delimi...
Thai is a low-resource language, so it is often the case that data is not available in sufficient qu...
�� 2021 The Authors. Published by ACL. This is an open access article available under a Creative Com...
The aim of this thesis is to design and implement a computational linguistic module for analysing Th...
Unlike English, there is no explicit sentence marker in Thai language. Conventionally, a space is pl...
For languages without word boundary delimiters, dictionaries are needed for segmenting running texts...
A Thai written text is a string of symbols without explicit word boundary markup. A method for a dev...
The sentence segmentation task is the task of segmenting a text corpus into sentences. Segmenting we...
In Thai language, the word boundary is not explicitly clear, therefore, word segmentation is needed ...
The Thai written language is one of the languages that does not have word boundaries. In order to di...
International audienceTokenization is widely regarded as a solved problem due to the high accuracy t...
Abstract This paper discusses a Thai corpus, TaLAPi, fully annotated with word segmentation (WS), pa...
�� 2020. Published by ACL. This is an open access article available under a Creative Commons licence...
Myanmar sentences are written as contiguoussequences of syllables with no characters delimiting thew...
In this work we address the problems of sentence segmentation and tokenization. Informally the task ...