Toward a More Global and Coherent Segmentation of Texts

TitreToward a More Global and Coherent Segmentation of Texts
Type de publicationArticle de revue
AuteurLamprier, Sylvain, Amghar, Tassadit , Levrat, Bernard , Saubion, Frédéric
EditeurTaylor & Francis
TypeArticle scientifique dans une revue à comité de lecture
Année2008
LangueAnglais
Date2008
Numéro3
Pagination208 - 234
Volume22
Titre de la revueApplied Artificial Intelligence
ISSN0883-9514
Résumé en anglais

The automatic text segmentation task consists of identifying the most important thematic breaks in a document in order to cut it into homogeneous passages. Text segmentation has motivated a large amount of research. We focus here on the statistical approaches that rely on an analysis of the distribution of the words in the text. Usually, the segmentation of texts is realized sequentially on the basis of very local clues. However, such an approach prevents the consideration of the text in a global way, particularly concerning the granularity degree adopted for the expression of the different topics it addresses. We thus propose here two new segmentation algorithms—ClassStruggle and SegGen—which use criteria rendering global views of texts. ClassStruggle is based on an initial clustering of the sentences of the text, thus allowing the consideration of similarities within a group rather than individually. It relies on the distribution of the occurrences of the members of each class 1 to segment the texts. SegGen proposes to evaluate potential segmentations of the whole text thanks to a genetic algorithm. It attempts to find a solution of segmentation optimizing two criteria, the maximization of the internal cohesion of the segments and the minimization of the similarity between adjacent ones. According to experimental results, both approaches appear to be very competitive compared to existing methods.

URL de la noticehttp://okina.univ-angers.fr/publications/ua4282
DOI10.1080/08839510701881391
Lien vers le document

http://dx.doi.org/10.1080/08839510701881391