A way of identification of text electronic documents classification attributes

 

Alexei Kebkalo, Anton Mikhailyuk

 

Specialized computer systems department, National Technical University of Ukraine “KPI”, 37, Peremohy ave, Kyiv, Ukraine, kebka@mail.ru

 

The problem of text documents automated classification is one of actual problems for today. There is an approach to classification of documents at which it is necessary to consider their structure and to identify their parts containing classification attributes. On the basis of these attributes classification is made. Using such approach to classification is necessary for information systems working with documents of certain types. Procedure of classification in this case consists of two steps: identification of the document’s parts containing attributes of classification, and directly classification.

In clause the approach for realization of a first step - identification of parts of the document for the subsequent classification is offered. The approach is based on formation of sets of markers for parts of the document and on their search in the examined document.

The offered approach can be used not only for classification, but also for summarization or automatic analysis of the text, for example, for automatic identification of attributes when adding new document to the document oriented system.