📜 ⬆️ ⬇️

Script collector quotes and recognition of text from video to Python

A post about a script that downloads videos from youtube and recognizes text in it. I decided to start right away with practical implementation. "Vdudictionary" - Script collector quotes collection of characters "Vood" in Python. Yury Dud and his project “Vood” need no introduction. The hottest interviews that are fun to watch. Yuri Alexandrovich is able to make an interesting show, regardless of whether you know the hero of a particular release, are you a fan, or hear this name for the first time.

How many cm do you have? What do you say to Putin? Do you listen to OXY?


These and many other questions are now associated with Dud. When an interviewee utters a phrase full of wisdom, caring editors carefully display it on the screen of our monitors with you in order to convey the whole point to us. My goal was to crystallize this wisdom of generations and create the dictionary “Vududexicon” or “Vdudictionary”.

Naturally, a person, even if he is not burdened with a specific IT background, should not collect these sayings with his hands. To do this, I sketched a script in python.
')


First of all, we need a file that we will process. For downloading videos from Youtube, I used the pytube module.

pip install pytube 

An example of downloading a file from youtube

 from pytube import YouTube a=YouTube('https://www.youtube.com/watch?v=RNbXm8WKmow') a.streams.first().download() 

File downloaded. Now we will begin to methodically identify frames with a concentrated meaning, with the wisdom of contemporaries, with the sayings of the heroes of our time.



In the old issues there was no rectangular plate, so we can search for text at the bottom of the screen. In new releases, you can use the same good old OpenCV to search for a rectangle, which will allow you to get a frame from the video.

 a,contours,h = cv2.findContours(gray3, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) for i in contours: cv2.drawContours(gray3,[i],0,(0,0,255),1) 

To install cv2 for python3 on raspberry3 I had to install many packages due to dependencies. Perhaps some packages are redundant, this is because of my inexperience, this is how it all started.

 sudo apt-get install build-essential cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev sudo apt-get -y install libopencv-dev sudo apt-get -y install build-essential checkinstall cmake pkg-config yasm sudo apt-get -y install libtiff4-dev libjpeg-dev libjasper-dev sudo apt-get -y install libavcodec-dev libavformat-dev libswscale-dev libdc1394-22-dev libxine-dev libgstreamer0.10-dev libgstreamer-plugins-base0.10-dev libv4l-dev sudo apt-get -y install python-dev python-numpy sudo apt-get -y install libtbb-dev sudo apt-get -y install libqt4-dev libgtk2.0-dev sudo apt-get -y install libfaac-dev libmp3lame-dev libopencore-amrnb-dev libopencore-amrwb-dev libtheora-dev libvorbis-dev libxvidcore-dev pip install opencv-python 

We put tesseract - this is an OCR engine, necessary for optical text recognition.

 sudo apt-get install tesseract-ocr sudo pip3 install pytesseract sudo pip3 install tesseract 

The releases use a very specific headset, which makes it difficult to recognize. In general, for the Cyrillic alphabet set. Download and transfer to / usr / share / tesseract-ocr / tessdata.



The input script gets the clip address on youtube. Downloads a file, starts processing one frame at a time in 5 seconds. If there is a square in the frame, cut it out, discolor it, increase the contrast and recognize it. If the string is less than 15 characters, do not consider it. You can, of course, use a string of less than 15 characters, but as one of the heroines of the program said:
-I don't know, boys, how you live with small strings.

Log in the log file text, time and a link at the time on youtube. We skip 5 seconds (do not ask why it was this figure that first came to mind, checking did not find the overlay of two quotes within this time). You can delete the video file and proceed to the next release.

Full script code:

Python 3 Script
 import cv2 import pytesseract import numpy as np from pytube import YouTube import os nameofvideo="RNbXm8WKmow" a=YouTube('https://www.youtube.com/watch?v='+nameofvideo) a.streams.first().download() title=a.title title2=title.replace("/","").replace(",","").replace(".","")+".mp4" os.rename(title2, "youtubefile.mp4") print(title) f=open('/var/www/python/'+str(nameofvideo)+'.txt','w') f.write(title+"<br>") f.write('<table><tr><td></td><td></td><td></td></tr>') spisoksimvolovpodudalenie=["*","/","|","\\",")","(","}","{","+","`","~","â„–","",":","$","#","@","%","[","]","&","^","' "] def udaleniesimvolov(stroka): for element in spisoksimvolovpodudalenie: stroka=stroka.replace(element,"") return stroka vidcap = cv2.VideoCapture('youtubefile.mp4') vidcap.set(cv2.CAP_PROP_POS_AVI_RATIO,1) durationsec=int(vidcap.get(cv2.CAP_PROP_POS_MSEC)/1000) print("duration: "+str(durationsec)+" sec") for thissec in range(0,durationsec,5): vidcap.set(cv2.CAP_PROP_POS_MSEC,thissec*1000) success,image = vidcap.read() gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) gray2 = cv2.addWeighted( gray, 1.5, gray, 0, 0.5) gray3 = gray2[450:670,0:1280] if success: print(str(thissec)+" sec.") text = udaleniesimvolov(pytesseract.image_to_string(gray3, lang='rus')) if len(text)>15: print (text) f.write('<tr><td>'+str(thissec)+'</td><td>'+text+'</td><td><a href="https://www.youtube.com/watch?v='+nameofvideo+"&t="+str(thissec)+'"></a></td></tr>') print("----") f.write('</table>') f.close() 


Script work example
Tolokonnikova - bisexuality, FACE, prison / vDud
Time, c.Quote
95“ACTIVISTS DON'T SHOULD HATE MENT.
THEY SHOULD ATTEMPT TO UNDER
WHAT MENTA DECIDED TO BE A MENT »
→
195PETER VERZILOV - PARTICIPANT OF PUSSY RIOT
FORMER HUSBAND HUSBAND
→
255Ekaterina Samutsevich→
570LOVE KNITWEAR!→
595'vlyaDYMTSR sorbYPn
‚
→
990PETER VERZILOV IN YOUTH LIVED IN JAPAN TOGETHER WITH PARENTS.
FATHER PETRA - NUCLEAR PHYSICS
→
995PETER VERZILOV IN YOUTH LIVED IN JAPAN TOGETHER WITH PARENTS.
FATHER PETRA - NUCLEAR PHYSICS
→
127011 SEPTEMBER 2018 PETR Vrrzipov WAS hospitalization
ToksikovdnimdtsionovU BRANCH
City Clinical Hospital named after Vdhrushins
→
1275SEPTEMBER 15 WAS DELIVERED BY A PRIVATE AIRPLANE
TO BERLIN CLINIC SNASHTE
→
128018 SEPTEMBER 2018 BERLIN DOCTORS
RATHER FIRST, VERZILOV WAS POISONED WITH SCOPOLAMINE.
SEPTEMBER 26 WAS DRAWN FROM BERLIN CLINICS
→
1285SEPTEMBER 18, 2018 BERLIN DOCTORS DECLARED
RATHER FIRST, VERZILOV WAS POISONED WITH SCOPOLAMINE.
SEPTEMBER 26 WAS DRAWN FROM BERLIN CLINICS
→
1395"MEDIAZONA" - intvrnvt-publishing about the courts,
ARRESTS and rorsia. founded in 2014
HOPES of Tol_Konnikov and MARIA Alekhinoi
→
1590“If something is a scary advantage? None, '
that you MUST FROM YOURSELF
→
1760Yoko it - PUBLIC FIGURE, WEDDING JOE "...
PE
VICA, ARTIST,
→
2040"IF SOME PARENTS ARE MAD,
THIS, RATHER FOR A RESPECT! - "
→
2330"Maternal"
→
2425GRAD KITEZH - A BATHING CITY, _DISTENED, according to the commitment,
IN THE EVERY PART OF THE NIZHNY NOVGOROD REGION,
ON THE SHORES OF THE SVETLOYAR LAKE
→
2515"We are a lovers and lovers
key and WRITERS »
→
2550NOW- IN THE HARD OPPOSITION OF ROOSII ›. '
LIVES AND WORKS IN THE USA
→
2745TOLOKONNIKOVA TRAINED IN PRISON 661 DAY.
. FROM 3 MARCH 2012 TO DECEMBER 23, 2013
At md.
→
2985VPTN - TERM, IDENTIFIED ON SLENG
LGBT COMMUNITY MALE GIRL'S GIRL '‚
00 SHEETS "UNDER MALOYI_K_A"
→
2990VTSTSN - TERM, IDENTIFY ON SLENG _
LGBT COMMUNITY MALE GIRL LIKE
WITH CUTTING "UNDER THE BOY"
→
3280"SUCH RUSSIAN TIMES" YOU
; „B?
→
3290SHIZO - PENAL INSULATOR. DEPARTMENT OF CORRECTIONAL INSTITUTION,
WHERE THE CAMERAS FOR NARSHYTELEYI CONTENT MODE.
PERSONS LOCATED IN FINE INSULATOR,
SIGNIFICANTLY LIMITED TO RIGHTS
‚›, - "
→
3315“A MAN who sits for a long time,“ E;
REVIEWS HIS LIFE ”; 3
→
3510AFTER THE SHARE IN THE TEMPLE OF CHRIST - '‚PASTER WERE DETAINED AND
THERE ARE CONDEMNED THREE PARTICIPANTS OF KYUT RPZZU
HOPE TOLOKONNIKOVA, MARIA ALEKHIN and EKATERINA SAMUTSEVICH
→
3540EKATERINA SAMUM
GOT TWO YEARS CONDITIONALLY
→
3660“YOUNG PEOPLE are wildly trashing. __
that there is NO SEXUDAL ACCOUNT for the VAT; '
→
3740HOPE OF TOLOKONNIKOV DOUBLE DECLINED HUNGER
IN THE MORDIC COLONY N ° 14 ON THE REQUIREMENT OF TRANSFER
IN ANOTHER PLACE OF DEPARTURE OF PUNISHMENT
→
4275SPEECH ON SHARES OF RYUT RYUTS SMILIZINGER JOINS THE GAME »_
In the final of the 2018 World Cup in Moscow
→
4495‹
'
"COMBINATION OF GAME AND POSITION"
→
4735“IF I WILL BE BAD RHYTHM AND GOOD,
I WILL CHOOSE BAD. ”
→
4755“ZOO PARK HISTORY”
→
4800BERNIE SANDERS - CANDIDATE IN US PRESIDENTS
FOR THE ELECTION 2016_ODA. LOST PREMISES
DEMOCRATIC PARTY HIPPARI CLINTON
-
→
4820. ZADRTS S IN
Persistence
_umvdiv speak nd RaznBŃ–h yazydkh
→
4865“Nice torch”
→
5055"" "
“REP is a comprehension of reality”
→


The script obviously has some problems with the recognition of the “specific” WILD-font. I see the solution to this problem in the finalization of the dictionary file for OCR and in the post-processing of text through PyEnchant.

With a slight refinement, this script can be used to search for embedded subtitles, their recognition and automatic translation into another language.

If you can contribute to Yuri finding out about this experiment, please do so without delay. #habr #vdudictionary VK , FB .

Thanks for attention! The script and the post came out as a result of the flight of fancy when exploring OpenCV for my robot project for collecting golf balls .

Source: https://habr.com/ru/post/428147/


All Articles