Twitter-based bot Markov and phrases from the series

I looked through the forums looking for questions that python programmers ask during interviews and came across one very wonderful one. I will quote him at ease: "They asked to write a generator of nonsense based on an n-th Markov chain." “But I don’t have such a generator yet!” Shouted my inner voice, “Hurry up and open the sublime and write!” He continued insistently. Well, I had to obey.

And here I will tell you how I did it.

It was immediately decided that the generator would put all its thoughts on Twitter and its website. I chose Flask and PostgreSQL as the main technologies. They will communicate with each other through SQLAlchemy.
')

Structure.

So. The models look like this:

class Srt(db.Model): id = db.Column(db.Integer, primary_key = True) set_of_words = db.Column(db.Text()) list_of_words = db.Column(db.Text()) class UpperWords(db.Model): word = db.Column(db.String(40), index = True, primary_key = True, unique = True) def __repr__(self): return self.word class Phrases(db.Model): id = db.Column(db.Integer, primary_key = True) created = db.Column(db.DateTime, default=datetime.datetime.now) phrase = db.Column(db.String(140), index = True) def __repr__(self): return str(self.phrase)

As the source code, it was decided to take the subtitles from popular TV shows. The Srt class stores an ordered set of all words from revised subtitles to a single episode and a unique set of the same words (without repetition). So it will be easier for bot to search for a phrase in specific subtitles. First, he will check if the set of words is contained in the set of words of the subtitles, and then see if they are there in the right order.

The first word of a phrase is a random word that starts with a capital letter. UpperWords serves to store these words. Words are recorded there as well without repetition.

Well, the Phrases class is needed to store already generated tweets.
The structure is desperately simple.

The subtitle parser format .srt displayed in a separate module add_srt.py. There is nothing extraordinary, but if anyone is interested, all the source code is on GitHub .

Generator.

First you need to select the first word for tweet. As mentioned earlier, this will be any word from the UpperWords model. His choice is implemented in the function:

 def add_word(word_list, n): if not word_list: word = db.session.query(models.UpperWords).order_by(func.random()).first().word #postgre elif len(word_list) <= n: word = get_word(word_list, len(word_list)) else: word = get_word(word_list, n) if word: word_list.append(word) return True else: return False

The choice of this word is implemented directly by the line:

word = db.session.query (models.UpperWords) .order_by (func.random ()). first (). word

If you use MySQL, then you need to use func.rand () instead of func.random (). This is the only difference in this implementation, everything else will work completely identical.

If the first word already exists, the function looks at the chain length, and depending on this, selects with how many words in the text you need to compare our list (chain of the n-th order) and get the next word.

And we get the next word in the get_word function:

 def get_word(word_list, n): queries = models.Srt.query.all() query_list = list() for query in queries: if set(word_list) <= set(query.set_of_words.split()): query_list.append(query.list_of_words.split()) if query_list: text = list() for lst in query_list: text.extend(lst) indexies = [i+n for i, j in enumerate(text[:-n]) if text[i:i+n] == word_list[len(word_list)-n:]] word = text[random.choice(indexies)] return word else: return False

First of all, the script runs over all loaded subtitles and checks whether our set of words is included in the set of words of specific subtitles. Then the texts of the screened subtitles are added to one list and the whole phrase is searched for in it and the positions of the words following these phrases are returned. Everything ends with a blind choice (random) of the word. Just like in life.
So words are added to the list. The very same tweet is going to function:

 def get_twit(): word_list = list() n = N while len(' '.join(word_list))<140: if not add_word(word_list, n): break if len(' '.join(word_list))>140: word_list.pop() break while word_list[-1][-1] not in '.?!': word_list.pop() return ' '.join(word_list)

It's very simple - it is necessary that the tweet does not exceed 140 characters and ends with a punctuation mark ending the sentence. Everything. The generator has done its job.

Display on the site.

Display on the site involved in the module views.py.

 @app.route('/') def index(): return render_template("main/index.html")

Just displays the template. All tweets will be pulled from it using js.

 @app.route('/page') def page(): page = int(request.args.get('page')) diff = int(request.args.get('difference')) limit = 20 phrases = models.Phrases.query.order_by(-models.Phrases.id).all() pages = math.ceil(len(phrases)/float(limit)) count = len(phrases) phrases = phrases[page*limit+diff:(page+1)*limit+diff] return json.dumps({'phrases':phrases, 'pages':pages, 'count':count}, cls=controllers.AlchemyEncoder)

Returns tweets of a specific page. It is necessary for the endless scrolling. Everything is pretty ordinary. diff is the number of tweets added after the site has been uploaded with an update. For this amount you need to shift the selection of tweets for the page.

And directly update itself:

 @app.route('/update') def update(): last_count = int(request.args.get('count')) phrases = models.Phrases.query.order_by(-models.Phrases.id).all() count = len(phrases) if count > last_count: phrases = phrases[:count-last_count] return json.dumps({'phrases':phrases, 'count':count}, cls=controllers.AlchemyEncoder) else: return json.dumps({'count':count})

On the client side, it is called every n seconds and uploads newly added tweets in real time. This is how the display of our tweet works. (If someone is interested, you can look at the AlchemyEncoder class in controllers.py, using it to serialize tweets received from SQLAlchemy)

Adding tweets to the database and posting to Twitter.

For posting on Twitter, I used tweepy. Very handy battery, start up immediately.

What it looks like:

 def twit(): phrase = get_twit() twited = models.Phrases(phrase=phrase) db.session.add(twited) db.session.commit() auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) api = tweepy.API(auth) api.update_status(status=phrase)

I called this function in cron.py at the root of the project, and, as you might guess, it runs on cron. Every half an hour a new tweet is added to the database and Twitter.

It all worked!

Finally.

At the moment I have loaded all the subtitles for the series “Friends” and “The Big Bang Theory”. The degree of the Markov chain has so far chosen equal to two (with an increase in the base of the subtitles, the degree will increase). How it works can be viewed on Twitter , and all the sources are available and lie on the githaba . Intentionally not posting a link to the site itself. If someone needs her, he will definitely get her.

Thank you all for your attention. See you again!

Source: https://habr.com/ru/post/249637/

All Articles