Natasha is an analogue of Tomita-parser for Python ( Yargy-parser ) plus a set of ready-made rules for retrieving names, addresses, dates, amounts of money and other entities.The article shows how to use the ready-made rules of Natasha and, most importantly, how to add your own using the Yargy-parser.
from natasha import NamesExtractor from natasha.markup import show_markup, show_json extractor = NamesExtractor() text = ''' , , : . , . 1"" - , , - , , , ! ! ''' matches = extractor(text) spans = [_.span for _ in matches] facts = [_.fact.as_json for _ in matches] show_markup(text, spans) show_json(facts) >>> , , : [[ ]] [[ ]]. , . 1"" - [[ ]], , - [[ ]], , , ! ! [ { "first": "", "middle": "", "last": "" }, { "first": "", "middle": "", "last": "" }, { "first": "", "middle": "", "last": "" }, { "first": "", "middle": "" } ]
NamesExtractor
changed to AddressExtractor
: from natasha import AddressExtractor from natasha.markup import show_markup, show_json extractor = AddressExtractor() text = ''' â„–71 2. .51 ( : , ) . 7 881 574-10-02 ,.,. , .8 , 4 ''' matches = extractor(text) spans = [_.span for _ in matches] facts = [_.fact.as_json for _ in matches] show_markup(text, spans) show_json(facts) >>> â„–71 [[ 2]]. [[ .51]] ( : , ) . 7 881 574-10-02 [[ ,.,. , .8 , 4]] [ { "parts": [ { "name": "", "type": "" }, { "number": "2" } ] }, { "parts": [ { "name": "", "type": "" }, { "number": "51", "type": "" } ] }, { "parts": [ { "name": "", "type": "" }, { "name": "", "type": "" }, { "name": " ", "type": "" }, { "number": "8 ", "type": "" }, { "number": "4", "type": "" } ] } ]
AddressExtractor
on the dataset with comments to the DSS can be in the repository with examples .AddressExtractor
and NamesExtractor
. from natasha import ( NamesExtractor, AddressExtractor, DatesExtractor, MoneyExtractor ) from natasha.markup import show_markup, show_json extractors = [ NamesExtractor(), AddressExtractor(), DatesExtractor(), MoneyExtractor() ] text = ''' 10 1970 , -, . , 5/1 8 000 ( ) 00 ''' spans = [] facts = [] for extractor in extractors: matches = extractor(text) spans.extend(_.span for _ in matches) facts.extend(_.fact.as_json for _ in matches) show_markup(text, spans) show_json(facts) >>> [[ ]] [[10 1970 ]], [[ -, . , 5/1]][[]], 5/1 [[8 000 ( ) 00]] [ { "first": "", "middle": "", "last": "" }, { "last": "" }, { "parts": [ { "name": "-", "type": "" }, { "name": "", "type": "" }, { "number": "5/1", "type": "" } ] }, { "year": 1970, "month": 1, "day": 10 }, { "integer": 8000, "currency": "RUB", "coins": 0 } ]
e = Extractor(); r = e(text); ...
e = Extractor(); r = e(text); ...
e = Extractor(); r = e(text); ...
User is not available any settings. In practice, getting by with ready-made rules is rarely obtained. For example, Natasha will not understand the date "April 21, 2017," because the rules do not include the day number in quotes. The library will not understand the address “Lyubertsy district, village Motyakovo, d. 61/2”, because there is no street name in it. from yargy import rule, and_, Parser from yargy.predicates import gte, lte DAY = and_( gte(1), lte(31) ) MONTH = and_( gte(1), lte(12) ) YEAR = and_( gte(1), lte(2018) ) DATE = rule( YEAR, '-', MONTH, '-', DAY ) parser = Parser(DATE) text = ''' 2018-02-23, 2015-12-31; 8 916 364-12-01''' for match in parser.findall(text): print(match.span, [_.value for _ in match.tokens]) >>> [1, 11) ['2018', '-', '02', '-', '23'] >>> [13, 23) ['2015', '-', '12', '-', '31'] >>> [33, 42) ['364', '-', '12', '-', '01']
r'\d\d\d\d-\d\d-\d\d'
, although it will throw a nonsense like "1234-56-78".gte
and lte
in the example above are predicates. Many ready-made predicates are built into the parser, there is an opportunity to add your own . Pymorphy2 is used to determine the morphology of words. For example, the predicate, gram('NOUN')
works on nouns, normalized('')
marks all forms of the word "January." Add rules for dates like "January 8, 2014", "June 15, 2001": from yargy import or_ from yargy.predicates import caseless, normalized, dictionary MONTHS = { '', '', '', '', '', '', '', '', '', '', '', '' } MONTH_NAME = dictionary(MONTHS) YEAR_WORDS = or_( rule(caseless(''), '.'), rule(normalized('')) ) DATE = or_( rule( YEAR, '-', MONTH, '-', DAY ), rule( DAY, MONTH_NAME, YEAR, YEAR_WORDS.optional() ) ) parser = Parser(DATE) text = ''' 8 2014 , 15 2001 ., 31 2018''' for match in parser.findall(text): print(match.span, [_.value for _ in match.tokens]) >>> [21, 36) ['15', '', '2001', '', '.'] >>> [1, 19) ['8', '', '2014', ''] >>> [38, 53) ['31', '', '2018']
Date(month=5, day=8)
, Name(first='', last='')
, Yargy provides an interpretation procedure for this. The result of the parser is the parse tree: match = parser.match('05 2011 ') match.tree.as_dot
.interpretation(...)
method: from yargy.interpretation import fact Date = fact( 'Date', ['year', 'month', 'day'] ) DAY = and_( gte(1), lte(31) ).interpretation( Date.day ) MONTH = and_( gte(1), lte(12) ).interpretation( Date.month ) YEAR = and_( gte(1), lte(2018) ).interpretation( Date.year ) MONTH_NAME = dictionary( MONTHS ).interpretation( Date.month ) DATE = or_( rule(YEAR, '-', MONTH, '-', DAY), rule( DAY, MONTH_NAME, YEAR, YEAR_WORDS.optional() ) ).interpretation(Date) match = parser.match('05 2011 ') match.tree.as_dot
parser = Parser(DATE) text = '''8 2014 , 2018-12-01''' for match in parser.findall(text): print(match.fact) >>> Date(year='2018', month='12', day='01') >>> Date(year='2014', month='', day='8')
.interpretation(...)
user specifies how to normalize the fields: from datetime import date Date = fact( 'Date', ['year', 'month', 'day'] ) class Date(Date): @property def as_datetime(self): return date(self.year, self.month, self.day) MONTHS = { '': 1, '': 2, '': 3, '': 4, '': 5, '': 6, '': 7, '': 8, '': 9, '': 10, '': 11, '': 12 } DAY = and_( gte(1), lte(31) ).interpretation( Date.day.custom(int) ) MONTH = and_( gte(1), lte(12) ).interpretation( Date.month.custom(int) ) YEAR = and_( gte(1), lte(2018) ).interpretation( Date.year.custom(int) ) MONTH_NAME = dictionary( MONTHS ).interpretation( Date.month.normalized().custom(MONTHS.__getitem__) ) DATE = or_( rule(YEAR, '-', MONTH, '-', DAY), rule( DAY, MONTH_NAME, YEAR, YEAR_WORDS.optional() ) ).interpretation(Date) parser = Parser(DATE) text = '''8 2014 , 2018-12-01''' for match in parser.findall(text): record = match.fact print(record, repr(record.as_datetime)) >>> Date(year=2018, month=12, day=1) datetime.date(2018, 12, 1) >>> Date(year=2014, month=1, day=8) datetime.date(2014, 1, 8) match = parser.match('31 2014 .') match.fact.as_datetime >>> ValueError: day is out of range for month
Name
for names, Surn
for names. Let's take the name as a couple of words Name Surn
or Surn Name
: from yargy.predicates import gram Name = fact( 'Name', ['first', 'last'] ) FIRST = gram('Name').interpretation( Name.first.inflected() ) LAST = gram('Surn').interpretation( Name.last.inflected() ) NAME = or_( rule( FIRST, LAST ), rule( LAST, FIRST ) ).interpretation( Name )
parser = Parser(NAME) text = ''' ... ... ... ... ''' for match in parser.findall(text): print(match.fact) >>> Name(first='', last='') >>> Name(first='', last='') >>> Name(first='', last='')
.match(...)
method, the user specifies restrictions on the rules: from yargy.relations import gnc_relation gnc = gnc_relation() # gender, number case (, ) Name = fact( 'Name', ['first', 'last'] ) FIRST = gram('Name').interpretation( Name.first.inflected() ).match(gnc) LAST = gram('Surn').interpretation( Name.last.inflected() ).match(gnc) NAME = or_( rule( FIRST, LAST ), rule( LAST, FIRST ) ).interpretation( Name ) parser = Parser(NAME) text = ''' ... ... ... ... ''' for match in parser.findall(text): print(match.fact) >>> Name(first='', last='') >>> Name(first='', last='')
O(n 3 )
, where n
is the number of tokens. The code is written on pure Python, with an emphasis on readability, not optimization. In short, the library is slow. For example, on the task of extracting names, Natasha is 10 times slower than Tomita-parser . In practice, you can live with it:pip install natasha
. The library is tested on Python 2.7, 3.3, 3.4, 3.5, 3.6, PyPy and PyPy3.Source: https://habr.com/ru/post/349864/
All Articles