Among social networks, Twitter is more suitable for extracting text data due to the hard limit on the length of the message in which users are forced to put all the most essential.
I suggest to guess what technology is framed by this cloud of words?
Using the Twitter API you can extract and analyze a wide variety of information. An article on how to do this using the R programming language.
Writing the code takes not so much time, difficulties may arise due to changes and toughening of the Twitter API, the company seems to be seriously concerned about security issues after dragging out the US Congress following the investigation into the influence of “Russian hackers” on the US elections in 2016.
Why would someone need to commercially extract data from Twitter? Well, for example, it helps to make more accurate predictions about the outcome of sporting events. But I'm sure there are other user scripts.
For a start, it is clear that you need to have a Twitter account with the phone number. It is necessary to create an application, this step gives access to the API.
Go to the developer page and click on the button Create an app . The following is a page where you need to fill out information about the application. At the moment the page consists of the following fields.
http://127.0.0.1:1410
.The following are optional fields: the address of the terms of service page, the name of the organization, and so on.
When creating a developer account, select one of three options.
I chose Premium , I had to wait for approval for about a week. I can not say to everyone, whether they are giving it in a row, but it is worth trying in any case, and Standard will not go anywhere.
After you have created an application in the Keys and tokens tab, a set containing the following elements will appear. The names below and the corresponding variables are R.
Consumer API keys
api_key
api_secret
Access token & access token secret
access_token
access_token_secret
Install the necessary packages.
install.packages("rtweet") install.packages("tm") install.packages("wordcloud")
This part of the code will look like this.
library("rtweet") api_key <- "" api_secret <- "" access_token <- "" access_token_secret <- "" appname="" setup_twitter_oauth ( api_key, api_secret, access_token, access_token_secret)
After authentication, R will offer to save OAuth
codes to disk for later use.
[1] "Using direct authentication" Use a local file to cache OAuth access credentials between R sessions? 1: Yes 2: No
Both options are acceptable, I chose the 1st.
tweets <- search_tweets("hadoop", include_rts=FALSE, n=600)
The include_rts
key allows you to control the inclusion of retweets in the search or exclusion from it. At the output we get a table with many fields in which there are details and details of each record. Here are the first 20.
> head(names(tweets), n=20) [1] "user_id" "status_id" "created_at" [4] "screen_name" "text" "source" [7] "display_text_width" "reply_to_status_id" "reply_to_user_id" [10] "reply_to_screen_name" "is_quote" "is_retweet" [13] "favorite_count" "retweet_count" "hashtags" [16] "symbols" "urls_url" "urls_t.co" [19] "urls_expanded_url" "media_url"
You can make a more complex search string.
search_string <- paste0(c("data mining","#bigdata"),collapse = "+") search_tweets(search_string, include_rts=FALSE, n=100)
Search results can be saved in a text file.
write.table(tweets$text, file="datamine.txt")
We merge into the corpus of texts, filter out the official words, punctuation marks, and translate everything into lower case.
There is another search function - searchTwitter
, which requires the twitteR
library. In some ways, it is more convenient to search_tweets
, but in some ways it is inferior to it.
Plus - the presence of a filter in time.
tweets <- searchTwitter("hadoop", since="2017-09-01", n=500) text = sapply(tweets, function(x) x$getText())
Minus - the output is not a table, but an object of type status
. In order to use it in our example, you need to isolate the text field from the output. This makes sapply
in the second line.
corpus <- Corpus(VectorSource(tweets$text)) clearCorpus <- tm_map(corpus, function(x) iconv(enc2utf8(x), sub = "byte")) tdm <- TermDocumentMatrix(clearCorpus, control = list(removePunctuation = TRUE, stopwords = c("com", "https", "hadoop", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))
In the second line, the tm_map
function tm_map
needed in order to translate any Emodzhi characters into lowercase, otherwise the conversion to lower case using tolower
will end with an error.
Word clouds first appeared on Flickr photo hosting, as far as I know, and have since gained popularity. For this task we need a wordcloud
library.
m <- as.matrix(tdm) word_freqs <- sort(rowSums(m), decreasing=TRUE) dm <- data.frame(word=names(word_freqs), freq=word_freqs) wordcloud(dm$word, dm$freq, scale=c(3, .5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
The search_string
function allows you to set a language as a parameter.
search_tweets(search_string, include_rts=FALSE, n=100, lang="ru")
However, since the NLP package for R is poorly Russified, in particular, there is no list of service or stop words, I could not build a word cloud with the search in Russian. I would be glad if you find the best solution in the comments.
Well, actually ...
library("rtweet") library("tm") library("wordcloud") api_key <- "" api_secret <- "" access_token <- "" access_token_secret <- "" appname="" setup_twitter_oauth ( api_key, api_secret, access_token, access_token_secret) oauth_callback <- "http://127.0.0.1:1410" setup_twitter_oauth (api_key, api_secret, access_token, access_token_secret) appname="my_app" twitter_token <- create_token(app = appname, consumer_key = api_key, consumer_secret = api_secret) tweets <- search_tweets("devops", include_rts=FALSE, n=600) corpus <- Corpus(VectorSource(tweets$text)) clearCorpus <- tm_map(corpus, function(x) iconv(enc2utf8(x), sub = "byte")) tdm <- TermDocumentMatrix(clearCorpus, control = list(removePunctuation = TRUE, stopwords = c("com", "https", "drupal", stopwords("english")), removeNumbers = TRUE, tolower = TRUE)) m <- as.matrix(tdm) word_freqs <- sort(rowSums(m), decreasing=TRUE) dm <- data.frame(word=names(word_freqs), freq=word_freqs) wordcloud(dm$word, dm$freq, scale=c(3, .5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))
Short links:
Original links:
https://stats.seandolinar.com/collecting-twitter-data-getting-started/
https://opensourceforu.com/2018/07/using-r-to-mine-and-analyse-popular-sentiments/
http://dkhramov.dp.ua/images/edu/Stu.WebMining/ch17_twitter.pdf
http://opensourceforu.com/2018/02/explore-twitter-data-using-r/
https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
PS Hint, the cloud keyword on KDPV is not used in the program, it is related to my previous article .
Source: https://habr.com/ru/post/426657/