
Recently, the machines have won a series of convincing victories over people: they already play better go, chess, and even Dota 2. Algorithms compose music and write poetry. Scientists and entrepreneurs around the world make predictions about the future, in which artificial intelligence will far surpass the person. With a high probability in a few decades, we will live in a world in which robots not only drive cars and work in factories, but also entertain us. One of the important components of our life is humor. It is believed that only a person can make jokes. Despite this, many scientists, engineers, and even ordinary townsfolk wonder: can you teach a computer to joke?
The company
Gentleminds , the developer of machine learning and computer vision systems, together with FunCorp, tried to create a generator of funny image captions using the iFunny meme database. Since the application is English-language and is used primarily in the United States, the signatures will be in English. Details under the cut.
In contrast to the composition of music, in which there are laws of harmony, the nature of what makes us laugh is very difficult to describe. Sometimes we ourselves can hardly explain what made us laugh. Many researchers believe that a sense of humor is one of the last frontiers that artificial intelligence must overcome in order to get as close as possible to a person.
Studies show that a sense of humor was formed in people for a long time under the influence of sexual selection. This can be explained by the fact that there is a positive correlation between intelligence and a sense of humor. Even now, in our understanding, humor is a good marker of human intelligence. The ability to joke includes such complex elements as skillful language skills and horizons. After all, language proficiency is important for certain types of humor (for example, British), which are largely based on a play on words. In general, teaching an algorithm to joke is not an easy task.
Researchers from around the world have tried to teach a computer to make jokes. So, Janelle Shane created a neural network that writes jokes like “Knock, knock!” Who is there? ”(Knock-knock jokes). For training this network, a data set of 200 knock-knock jokes was used. On the one hand, this is a fairly simple task for AI, since all the jokes in this set have the same structure. On the other hand, the neural network simply finds associations between words in a small set of source data and does not give these words any meaning. The result is a joke, sharpened under the same pattern, which in most cases can hardly be called ridiculous.
')
In turn, researchers from the University of Edinburgh
presented a successful method of teaching computer jokes I like my X like I like my Y, Z. The main contribution of this work is the creation of the first fully unsupervised humor generation system. The resulting model is significantly superior to the base and generates funny jokes that are considered funny by people in 16% of cases. The authors use only a large amount of unpartitioned data, and this indicates that generating a joke does not always require deep semantic understanding.
Scientists from the University of Washington have created a
system that can come up with vulgar jokes on a pattern that's what she said - TWSS (literally: that's what she said; it can be translated into Russian as “if you know what I mean”). “That's what she said.” This is a well-known group of jokes that has become popular again after the series The Office. The task of TWSS is a problem with two distinctive characteristics: first, the use of nouns, which are euphemisms for the overtly sexual nature of nouns, and, second, ambiguity. For the TWSS solution, the system was used Double Entendre via Noun Transfer (DEviaNT). As a result, in 72% of cases, the DEviaNT system knew when to say “that's what she said — a great achievement for this type of natural language program.
The authors of the
article present a model for generating jokes based on neural networks. The model can generate a short joke related to a previously specified topic. An encoder is used to represent the user's theme information and an RNN decoder to generate jokes. The model is trained in short Conan Christopher O'Brien jokes using the POS Tagger. Quality was evaluated by five English-speaking people. On average, this model is superior to the probabilistic, trained writing of jokes of a fixed structure (the approach described above from the University of Edinburgh).
Researchers from Microsoft also
tried to teach the computer to joke. Using The New Yorker cartoon contest as data for training, they developed an algorithm for choosing from thousands of the funniest inscriptions provided by readers.
As you can see from all the above examples, to teach the car to joke is not an easy task. Moreover, it does not have a universal quality metric, since everyone can perceive the same joke differently. And the very wording “to invent a funny joke” does not look concrete.
In our experiment, we decided to ease the task a bit and add a context - an image. The system needed to come up with a funny signature to it. But, on the other hand, the task became a bit more complicated, as one more space was added and it was required to teach the algorithm to compare text and a picture.
The task of creating a funny picture caption can be reduced to choosing the right one from the existing base or generating a new caption using any method. In this experiment, we tried both approaches.
We relied on a base provided by iFunny. There were 17,000 memes in it, which we divided into two components: a picture and a signature. We used only memes, in which the text was located strictly above the picture:
We tried two approaches:
- signature generation (in one case by a Markov chain, in the other by a recurrent neural network);
- selection of the most suitable image for the image from the database. It was carried out on the basis of the visual component. In the first approach, images were searched for a signature inside the clusters built on the basis of memes. In the Word2VisualVec approach, which in this paper was called Membedding, we tried to transfer images and text into one vector space, in which the relevant caption would be close to the image.
The approaches below are described in more detail.
Base analysis
Any study in machine learning always begins with data analysis. First of all, I wanted to understand what kind of images are contained in the database. Using a classification network trained at
https://github.com/openimages/dataset , for each image we obtained a vector with ratings for each category and made a clustering based on these vectors. Selected 5 large groups:
- People.
- Food.
- Animals
- Cars.
- Animation.
The results of clustering were later used in the construction of the basic solution.
To assess the quality of the experiments, a test base of 50 images was collected that covered the main categories. Quality was evaluated by “expert” advice, determining whether it is funny or not.
Search by cluster
The approach was based on determining the cluster closest to the picture in which the signature was chosen randomly. The image descriptor was determined using a categorization neural network. We used 5 clusters, selected earlier, using the k-means algorithm: people, food, animals, animation, cars.
Examples of the results are shown below. Since the clusters were quite large, and the contents in them could still differ greatly in meaning, the number of phrases suitable for the picture was approximately 1 to 5 in relation to inappropriate. It may seem that this was due to the fact that the clusters were 5, but in fact, even if the cluster was determined correctly, there are still a large number of unsuitable signatures inside it.
 | Me buying food vs Me buying textbooks |
Boss: It says here that you love science Guy: Ya, I love to experiment Boss: What do you experiment with? Guy: Mostly just drugs and alcohol |
Cop: Did you get a good look at the suspect? Guy: Yes Cop: Was it a man or a woman? Guy: I don’t know them |
hillary: why didn't you tell me they were reopening the investigation? obama: bitch we emailed you |
 | "Ma'am do you have a permit for this business " Girl: does it look like I'm selling fucking donuts ?! |
For a couple of days because their queen was trapped inside the car. |
I found the guy in those problems with all the watermelons ... |
So that's what those orange cones were for |
Search by visual similarity
An attempt at clustering suggested that we should try to narrow down the search space. And if the clusters within themselves remained very diverse, then the search for a picture that most closely resembles the incoming one could bring results. As part of this experiment, we still used a neural network trained in 7880 categories. At the first stage, we passed through the network all the images and saved the top 5 rated categories, as well as the values ​​from the penultimate layer (it stores both visual information and information about categories). At the stage of searching for the caption for the picture, we received 5 best categories and looked for images with the most similar categories throughout the database. Of these, we took the 10 nearest, and from this set randomly chose a signature. Also, an experiment was conducted to search for values ​​from the last but one layer of the network. The results of both methods were similar. On average, 1-2 successful signatures accounted for 5 unsuccessful. This may be due to the fact that in the caption for visually similar photos of people, a lot of people played emotions in the photo and the situation itself. Examples are given below.
 | Me buying food vs Me buying textbooks |
Don't Act Like You Know Politics If You Don't Know Who This Is Q |
when u tell a joke and no one else laughs |
When good looking people have no sense of humor #whatawaste |
 | Assholes, meet your king. |
Free my boy |
I guess they didn't read the license plate |
When someone starts telling you how to drive from the backseat |
Membedding, or finding the most appropriate signature by casting the image descriptor into the vector space of text descriptors
The purpose of the construction of Membedding is a space in which the vectors of interest to us would be “close”. Let's try the approach from the Word2VisualVec
article .
We have pictures and captions for them. We want to find text that is “close” to the image. In order to solve this problem, we need:
- construct a vector describing the image;
- construct a vector describing the text;
- construct a vector space with the desired properties (the text vector is “close” to the image vector).
To build a vector describing the image, we use the neural network pre-trained for 6000+ classes
https://github.com/openimages/dataset . As a vector, we will take the output from the last but one layer of this network of dimension 2048.
Two approaches were used to vectorize the text: Bag Of Words and Word2Vec. Word2Vec was trained in words from all the image captions. The signature was transformed as follows: each word of the text was translated into a vector using
Word2Vec , and then the common vector was found according to the arithmetic mean rule - the average vector. Thus, an image was fed to the input of the neural network, and the average vector was predicted at the output. To “embed” text vectors into the vector space of image descriptors, a three-layer fully connected neural network was used.
With the help of a trained neural network, we calculate the vectors for the base of signatures.
Then, using the convolutional neural network, we obtain a descriptor to search for the image caption and look for the vector of signatures closest in cosine distance. You can choose the closest, you can randomly from n closest.
Good examples | Bad examples |
---|
How do you show up to your ex's funeral
| Being a teacher in 2018 summed up in one image.
|
You shall not pass me
| Me: Be careful closing the door Passenger:
|
To build a vector using the Bag of Words method that describes the text, we use the following method: calculate the frequency of three-letter combinations in the captions, discard those that occur less than three times, and compose a dictionary of the remaining combinations.
To convert text into a vector, we calculate the number of occurrences of three-letter combinations from the dictionary in the text. We obtain a vector of dimension 5322.
Result (5 "closest" signatures):
 | When she sends you |
when ur enjoying the warm weather in december but deep down u know this because of global warming |
The stress of not winning an oscar beginning to take its toll on Leo |
Dear God, please make our the next American president as strong as this yellow button. Amen. |
| My laptop is set up incorrect password attempts. |
My cat isn't thrilled with his new bird saving bib ... |
jokebud tell your cat “he” sa fucking pussy ” |
This cat looks just like Kylo Ren from Star Wars |
| Ugly guys winning bruh QC |
Single mom dresses as dad so her son wouldn't miss "Donuts With Dad" day at school |
Steak man gotta relax .... |
My friend went to prom with two dates. It didn't go as planned ... |
For similar images, the captions are almost the same:
| My girlfriend can take beautiful photos of our cat. I seemingly can't ... |
My laptop is set up incorrect password attempts. |
This cat looks just like Kylo Ren from Star Wars |
| Cats constantly look at you like you just asked them for a ride to the airport |
My laptop is set up incorrect password attempts. |
Here's my cat, sitting on the best wedding gift we received a picture of a face on it ... |
As a result, the ratio of successful examples to bad ones turned out to be approximately 1 to 10. This can most likely be explained by a small number of universal signatures, as well as by the presence in the training sample of a large percentage of memes, whose signature makes sense if the user has some prior knowledge.
Signature Generation: WordRNN Approach
The basis of this method is a two-layer recurrent neural network, each layer of which is an
LSTM . The main property of such networks is the ability to extrapolate time series, in which the next value depends on the previous one. The signature, in turn, is such a temporary series.
This network was trained to predict every next word in the text. For the training sample was taken the entire body of signatures. It was assumed that such a network would be able to learn to generate in some way meaningful or at least ridiculous because of its absurdity signatures.
Only the first word was asked, the rest was generated. The results were as follows:
The Trump : The
Trump CatsObama: obama LAUGHING dropping FAVORITE 4rd FAVORITE 4rd fucking longAsian: Asian RR IICat: catCar: Car Crispy “Emma: please" BUS 89% Starter be disappointed my pizza penises?PplContrary to expectations, the signatures obtained were rather a collection of words. Although in some places the structure of sentences was rather well imitated and some pieces were meaningful.
Signature generation using Markov chains
Markov chains are a popular approach to natural language modeling. To build a Markov chain, the body of the text is divided into tokens, for example, words. Groups of tokens are assigned as states and probabilities of transitions between them and the next word in the body of the text are calculated. When generating, the next word is selected by sampling from the probability distribution obtained by analyzing the corpus.
This library was used for implementation, and signatures cleared of dialogues were used as a training base.
New line - new signature.
Result (state - 2 words):when your homies told you m 90 to
dwayne johnson & the rock are twins. it is a patriotic flag tan.
tryna get your joke
If you are not ready to go to work out please don't be sewing supplies ...
justin hanging with his legos
It is 9h and calling weed
texing a girl that can save meekResult (state - 3 words):for the answers for the homework
when u ugly so u gotta get creative
my dog ​​is gonna die
when you hear a song
your girl goes out and you actually cleaned
chuck voted for prom queen 150 times and you decide to start making healthier choices.
there are more on the stove
when you see your passcodeIn the state with three words, the text is more meaningful than with two, but it is hardly suitable for direct use. Probably, it can be used to generate signatures followed by human moderation.
Instead of conclusion
Teaching an algorithm to write jokes is an incredibly difficult task, but very interesting. Her decision will make the intellectual assistants more "human". As an example, you can imagine a robot from the film “Interstellar”, whose humor level would be regulated, and jokes would be unpredictable, unlike current versions of helpers.
In general, after all the experiments listed above, the following conclusions can be made:
- The approach, consisting in the generation of signatures, requires a very complex and time-consuming work with the body of the text, the method of teaching, the architecture of the model; Also in this approach it is very difficult to predict the result.
- More predictable in terms of results is the approach with the selection of the signature from the existing database. But it is fraught with difficulties:
- memes, the meaning of which can be understood only with a priori information. Such memes are difficult to separate from the rest, and if they fall into the base, the quality of the jokes will decrease;
- memes in which you need to understand what is happening in the picture: what action, what situation. Such memes, again, getting into the base, reduce the quality.
- From an engineering point of view, it seems that at this stage a suitable solution is a careful selection of phrases by the editorial team under the most popular categories. These are selfies (as people usually check the system on themselves or on photos of friends and acquaintances), photos of celebrities (Trump, Putin, Kim Kardashian, etc.), pets, cars, food, nature. You can also enter the category “the rest” and have prepared jokes in case the system did not recognize what is shown in the picture.
In general, today, artificial intelligence cannot generate jokes (although not all people cope with this), but it is quite possible to choose the right one. We will follow the development of events and participate in them!