Selection of sentences from a continuous text is a procedure necessary for further text analysis in any system of analysis of natural languages.
What is the offer?
The first answer to this question is something ending in the characters “.”, “!”, Or “?”. But if we examine the texts more closely, we can find that “.” Is used not only to determine the end of a sentence, but also for abbreviations and abbreviations, and sometimes performs both of these roles. Regardless, the dot in 90% of cases is an indicator of the end of a sentence (Riley 1989).
But there are exceptions to be aware of which are necessary: sometimes other punctuation marks are used to select fragments that we could identify as sentences. Sometimes these fragments are distinguished on one side (and sometimes on both sides) by such signs as “:”, “;” and “-”, as for example in this simple example:
“The scene was written quickly and efficiently: the author was in a good mood while in Venice”Another problem with the practice of typing in many organizations is to place the closing quotation marks after the dot sign — that is, quotes must be included in the sentence.
Based on this information, many systems develop their own sentence selection algorithms, but most resemble this:
')
- Place a sign of the end of the sentence after all occurrences of ".", "!" And "?" (And maybe after ":", ";" and "-")
- Move the end of the sentence after the closing quotation mark, if one exists.
- Remove the sign of the end of the sentence in the following cases:
- If the previous word is a known abbreviation, the use of which is not intended at the end of a sentence, for example, “prof.”, “Ul.”, “D.”
- If the previous word is a known abbreviation, but it is not followed by a capital letter, for example: "etc.", "ml."
- Remove the end of the sentence after the "?" And "!" In the case if they are followed by words without a capital letter.
But such rules (with minor changes) do not work in all information domains - when changing the rules for processing documents or dial-up personnel, changes are required to improve the quality of the allocation of proposals.
There are developments associated with the use of statistical data to calculate sentences. Riley used a statistical classification tree to define the boundaries of a sentence. To do this, he used the length parameters and the register of words preceding the end of the sentence (although a fairly large amount of marked data was required to create this tree). Although there are other developments that can be found on the Internet, based on neural networks and entropy calculations, which give an accuracy of determining the boundaries of supply 98-99% and 99.25%, respectively.
Literature:
- Riley, Michael D. 1989. "Some applications for language and language indexing." In Proceedings of the DARPA Speech and Natural Language Workshop, pp. 339-352. Morgan kaufmann