The UN speech that US president Donald Trump gave to the United Nations General Assembly on Sept. 19, 2017 was fascinating for several reasons. For me personally, it was interesting because the speech included surprisingly complex sentences and statements, not characteristic of Trump’s previous talks. It was also intriguing because this speech was quite undiplomatic and fierce to a point that steered the world to the brink of a thermonuclear war. At least North Korea is genuinely pissed.

As it can be expected, when a president addresses the UN, several countries are mentioned; some in a favorable and others in a negative context. I wanted to connect the sentiments of the speech to the countries, and see the emotional context of these countries in the talk. This analysis of course does not reflect the USA’s opinion on the countries, as the text is too short for this, and some countries are only mentioned a couple of times.

library(magrittr)
library(tidyverse)
library(rvest)
library(tidytext)
library(forcats)
library(wordcloud)
library(wordcloud2)
library(rworldmap)
library(stringr)
library(ggrepel)

# Get speech excerpt
url <- "http://www.politico.com/story/2017/09/19/trump-un-speech-2017-full-text-transcript-242879"
speech_excerpt <- 
        read_html(url) %>% # Download whole homepage
        html_nodes("style~ p+ p , .lazy-load-slot+ p , .fixed-story-third-paragraph+ p , .story-related+ p , p~ p+ p") %>% # Select the required elements (by css selector)
        html_text() %>% # Make it text
        .[-2] %>% # Remove some random homepage text
        gsub("Mr.", "Mr", ., fixed = T) %>% # Make sure that dots in the text will not signify sentences
        gsub("Latin America","latin_america",.) %>% # Not to confuse with USA
        gsub("United States of America", "usa", .) %>% # USA has to be preserved as one expression
        gsub("United States", "usa", .) %>% 
        gsub("America", "usa", .) %>% 
        gsub("Britain", "uk", .) %>% # UK is mentioned 
        gsub("North Korea", "north_korea", .) %>% # North Korea should be preserved as one word for now
        data_frame(paragraph = .)

# Tokenize by sentence                        
speech_sentences <- 
        speech_excerpt %>% 
        unnest_tokens(sentence, paragraph, token = "sentences")
        
# Tokenize by word
speech_words <- 
        speech_excerpt %>% 
        unnest_tokens(word, paragraph, token = "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        # Here comes a nasty manual stemming of country names. Sadly, I failed to get satisfactory results  on country names with standard stemmers (I tried snowballC, hunspell, and textstem). I also tried to create a custom dictionary with added country names to no avail. What am I missing? Anyway, this works.
        mutate(word = word %>% 
                       str_replace_all("'s$","") %>% # Cut 's
                       if_else(. == "iranian", "iran", .) %>% 
                       if_else(. %in% c("usans", "north koreans"), str_replace(., "ns$",""),.) %>% 
                       if_else(. %in% c("usan","syrian","african","cuban","venezuelan"), str_replace(., "n$",""),.)
        )

Exploring the text

The following word cloud shows the words mentioned in the speech. Larger words were used more frequently than smaller ones.

speech_words %>% 
        anti_join(stop_words, by = "word") %>% 
        count(word, sort = TRUE) %>% 
        wordcloud2()

        # wordcloud2(figPath = "trump.png") # Wanted to make a wordcloud in the form of Trump's head, but the package has a know bug that prevented me to do so.

Next, I looked at the most frequent emotional words - used at least three times - in the speech. It turns out that the majority of frequent emotional words had a positive connotation (e.g. prosperity, support, strong, etc.). The most frequent emotional words were related to conflict (conflict, confront, etc.).

# Check emotional words that were uttered at least 3 times
speech_words %>% 
        count(word) %>% 
        inner_join(get_sentiments("bing"), by = "word") %>% 
        filter(n >= 3) %>% 
        mutate(n = if_else(sentiment == "negative", -n, n)) %>% 
        ggplot() +
                aes(y = n, x = fct_reorder(word, n), fill = sentiment) +
                geom_col() +
                coord_flip() +
                labs(x = "word", 
                     y ="Occurance in speech",
                     title = "Most common words in Trump's 17/09/19 UN speech by sentiment")

Just to show the less frequent emotional words too, the next word cloud shows all emotional word sentiments. It is actually a comparison cloud, which clusters words with different sentiments together, and colors them differently.

speech_words %>%
        inner_join(get_sentiments("bing"), by = "word") %>% 
        count(word, sentiment, sort = TRUE) %>%
        spread(sentiment, n, fill = 0L) %>%
        as.data.frame() %>% 
        remove_rownames() %>% 
        column_to_rownames("word") %>% 
        comparison.cloud(colors = c("red", "blue"))

Let’s see which countries were mentioned the most. Obviously, the USA! Also Iran, Venezuela, and North Korea were mentioned several times. Apart from these, most countries were mentioned only a couple of times during the speech.

# Load map database
map_world <- 
        map_data(map="world") %>% 
        mutate(region = region %>% str_to_lower()) # Make country name lower case to match word

# Calculate mentions of a country, and join geodata
trump_countries <-
        speech_words %>% 
        count(word) %>% 
        right_join(map_world, by = c("word" = "region")) %>% # Match country coordinates to speech
        select(region = word, everything())

# Get country names with the middle of the country coordinates
country_names <- 
        trump_countries %>% 
        drop_na(n) %>%
        group_by(region) %>% 
        summarise(lat = mean(lat),
                  long = mean(long))

trump_countries %>% 
        ggplot() +
        aes(map_id = region, 
            x = long, 
            y = lat, 
            label = paste0(region %>% str_to_title(),": ", n)) +
        geom_map(aes(fill = n %>% log10(.)), 
                 map = trump_countries) +
        geom_label_repel(data = trump_countries %>% 
                                 drop_na(n) %>% 
                                 group_by(region) %>% 
                                 slice(1), 
                         alpha = .75) +
        scale_fill_gradient(low = "lightblue", 
                            high = "darkblue", 
                            na.value = "grey90") +
        labs(title = "Number of mentions by country", 
             x = "Longitude", 
             y = "Latitude") +
        theme_minimal() +
        theme(legend.position = "none")

Checking how the speech sentiment develops over time, and what countries are mentioned

Next, I wanted to see how the speech developed over time, and what was the sentiment of the sentences. Moreover, I wanted to include which countries were mentioned in particular parts of the talk.

# Sentiment of each sentence
sentence_sentiment <-
speech_sentences %>% 
        mutate(sentence_num = row_number(),
               sentence_length = length(sentence)
        ) %>% 
        unnest_tokens(word, sentence, "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        # Here comes a nasty manual stemming of country names. Sadly, I failed to get satisfactory results  on country names with standard stemmers (I tried snowballC, hunspell, and textstem). I also tried to create a custom dictionary with added country names to no avail. What am I missing? Anyway, this works.        
        mutate(word = word %>% 
                       str_replace_all("'s$","") %>% # Cut 's
                       if_else(. == "iranian", "iran", .) %>% 
                       if_else(. %in% c("usans", "north koreans"), str_replace(., "ns$",""),.) %>% 
                       if_else(. %in% c("usan","syrian","african","cuban","venezuelan"), str_replace(., "n$",""),.)
        ) %>% 
        left_join(get_sentiments("bing"), by = "word") %>%
        mutate(sentiment_score = case_when(sentiment == "positive" ~ 1,
                                           sentiment == "negative" ~ -1,
                                           is.na(sentiment) ~ NA_real_)) %>%
        group_by(sentence_num) %>%
        summarise(sum_sentiment = sum(sentiment_score, na.rm = T),
                  sentence = paste(word, collapse = " "))

# Which sentence has a country name
country_sentence <- 
        speech_sentences %>% 
        mutate(sentence_num = row_number()) %>% 
        unnest_tokens(word, sentence, "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        right_join(country_names %>% select(region), by = c("word" = "region")) %>% 
        arrange(sentence_num)

# Sentiment for each country
country_sentiment <-         
        sentence_sentiment %>% 
        full_join(country_sentence, by = "sentence_num") %>% 
        select(region = word, sum_sentiment) %>% 
        drop_na() %>% 
        group_by(region) %>% 
        summarise(country_sentiment = sum(sum_sentiment, na.rm = T))

First, it is important to note that the sentiment analysis is based on the summarized sentiments for each sentence, which can be misleading. For example, in the middle of the speech, Israel and the US are mentioned in a very negative sentence. However, the negative tone was created to condemn Iran in the next sentence. So, you can see that the isolated analysis of the sentences can be misleading. To mitigate this error, I calculated the rolling mean for the sentence sentiments, so each sentence now contains the “spillover” sentiment from previous and following sentences.

sentence_sentiment %>% 
        full_join(country_sentence) %>% 
        mutate(roll_sentiment = zoo::rollmean(sum_sentiment, 3, fill = 0, align = "center")) %>% # Calculate a rolling mean with a window of 3
        mutate(sentiment_type = case_when(roll_sentiment > .5 ~ "positive",
                                          roll_sentiment < (-.5) ~ "negative",
                                          (roll_sentiment > -.5 & roll_sentiment < .5) ~ "neutral") %>% # Label sentence sentiments based on rolling mean
                       fct_rev()
        ) %>% 
        ggplot() +
                aes(x = sentence_num, 
                    y = roll_sentiment, 
                    label = word %>% str_to_title()) +
                geom_hline(yintercept = 0, 
                           color = "grey", 
                           linetype = "dashed", 
                           size = 1.2) +
                geom_line(size = 1.2, 
                            color = "black") +
                geom_label_repel(aes(fill = sentiment_type), 
                                 alpha = .8, 
                                 segment.alpha = 0.5) +
                scale_fill_manual(values = c("green","grey","red")) +
                theme_minimal() +
                labs(x = "Sentence number", 
                     y = "Sentence sentiment", 
                     title = "The summarised sentiment of sentences, and the appearance of country names in the speech \nby sentiment in sentence order",
                     subtitle = "The dashed line signifies neutral sentence sentiment. \nCountry label colors show the direction of the sentiment (positive/negative)")

As we can see, Trump started off the speech with positive statements, and mostly praised the USA. Then he came up with his black list, with North Korea, China, Ukraine, Russia, and Israel mentioned in a negative context. As USA is also mingled into these sentences, it is unavoidable that it also received some negative sentiment. Then the speech took a positive turn, and several Middle Eastern and African countries were referenced in a generally favorable context. Later, South American countries - such as Cuba, and especially Venezuela - are mentioned along with some harsh words, while European countries like Poland, France, and UK are received some kindness.

So, how about summarizing the country sentiments throughout the whole text, and plot them on a map, to see what emotional context each country is mentioned?

sentiment_map_data <- 
        trump_countries %>% 
        left_join(country_sentiment, by = "region")

sentiment_map_data %>% 
        mutate(country_sentiment = if_else(region == "usa", NA_real_, country_sentiment)) %>% # Exclude US
        ggplot() +
                aes(    map_id = region, 
                        x = long, 
                        y = lat, 
                        label = paste0(region %>% str_to_title(), ": ", country_sentiment)
                        ) +
                geom_map(aes(fill = country_sentiment), 
                         map = trump_countries) +
                scale_fill_gradient(high = "green", 
                                    low = "red", 
                                    na.value = "grey90") +
                geom_label_repel(data = sentiment_map_data %>%
                                         drop_na(n) %>%
                                         group_by(region) %>%
                                         slice(1),
                                         alpha = .5
                                 ) +
                theme_minimal() +
                labs(title = "Sentiment of the sentences where countries were mentioned (USA excluded)", 
                     x = "Longitude", 
                     y = "Latitude")

Interestingly, Israel received the lowest setiment score (as it was mentioned as a victim of genocide and hatred), so the negative sentiment was not directed toward the country; in contrast to Cuba and Venezuela. China, Ukraine and North Korea also obtained negative sentmiments in sum. Most countries remained in low positive range, and the positive endpoint of this scale was the USA, that collected the most positive sentiment.

Distinct emotions associated with each country

Let’s look into specific emotions using the NRC sentiment dictionary! It is also possible to make an association between certain words and distinct emotions. The next plot shows the frequency of each emotion in the talk. It seems like the dominant emotion was trust, followed by fear and anticipation.

speech_words %>%
        inner_join(get_sentiments("nrc"), by = "word") %>% # Use distinct emotion dictionary
        filter(!sentiment %in% c("positive","negative")) %>% # Only look for distinct emotions
        group_by(sentiment) %>% 
        count(sentiment, sort = T) %>% 
        ggplot() +
                aes(x = fct_reorder(sentiment %>% str_to_title, -n), 
                    y = n, 
                    label = n) +
                geom_col() +
                geom_label(vjust = 1) +
                theme_minimal() +
                labs(title = "The occurance of words linked to distinct emotions in the speech", 
                     x = "Word", 
                     y = "Frequency")

We can also calculate the number of emotional words associated with distinct emotions by country., As we can see, the absolute number of emotional words can be biased by the number of all mentions. So it might be more interesting to see the relative proportions.

sentence_emotions <-
        speech_sentences %>% 
        mutate(sentence_num = row_number(),
               sentence_length = length(sentence)
        ) %>% 
        unnest_tokens(word, sentence, "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        # Here comes a nasty manual stemming of country names. Sadly, I failed to get satisfactory results  on country names with standard stemmers (I tried snowballC, hunspell, and textstem). I also tried to create a custom dictionary with added country names to no avail. What am I missing? Anyway, this works.        
        mutate(word = word %>% 
                       str_replace_all("'s$","") %>% # Cut 's
                       if_else(. == "iranian", "iran", .) %>% 
                       if_else(. %in% c("usans", "north koreans"), str_replace(., "ns$",""),.) %>% 
                       if_else(. %in% c("usan","syrian","african","cuban","venezuelan"), str_replace(., "n$",""),.)
        ) %>% 
        left_join(get_sentiments("nrc"), by = "word") %>%
        filter(!sentiment %in% c("positive","negative"))

country_emotions <-
        sentence_emotions %>% 
        group_by(sentence_num) %>% 
        count(sentiment) %>% 
        drop_na(sentiment) %>% 
        full_join(country_sentence, by = "sentence_num") %>% 
        select(sentence_num, region = word, emotion = sentiment, n) %>% 
        group_by(region, emotion) %>% 
        summarise(n = sum(n, na.rm = T)) %>% 
        drop_na(emotion, region) %>% 
        ungroup() %>% 
        right_join(modelr::data_grid(., region, emotion), by = c("region","emotion")) %>% 
        mutate(n = if_else(is.na(n), 0L, n)) 

# Barplot for emotions for each country
country_emotions %>% 
        mutate(region = region %>% str_to_title()) %>% 
        ggplot() + 
                aes(x = fct_rev(emotion), y = n) +
                geom_col(position = "dodge") +
                facet_wrap(~region) +
                coord_flip() +
                theme_minimal() +
                labs(title = "Number of words associated with a distinct emotion",
                     y = "Number of words associated with distinct emotions",
                     x = NULL)

This next plot might be a bit difficult to read first. It shows the proportion of distinct emotional words, associated with a country. For example, by looking at the plot, you can see that Ukraine had a high proportion of anger, fear, and sadness, compared to the other countries. Russia was mentioned more with trust and anticipation. Venezuela was mentioned in sentences that contained the most disgust related words proportionally.

# Stacked area plot
country_emotions %>% 
        mutate(region = region %>% str_to_title()) %>%         
        ggplot() + 
                aes(x = fct_rev(region), y = n, fill = emotion, group = emotion) +
                geom_area(position = "fill", color = "black") +
                coord_flip() +
                theme_minimal() +
                scale_y_continuous(labels = scales::percent_format()) +
                scale_fill_brewer(palette = "Accent") +
                labs(title = "Proportion of distinct emotional words associated with a country",
                     y = NULL,
                     x = NULL)

All in all, this notebook shows the emotional tone of how countries were mentioned in Trump’s UN speech. Obviously, sentiment scores do not necessarily reflect the US’s opinion on these countries, especially for those with very few mentions. However, this method may be used in longer texts to demonstrate how certain topics (in this case, countries) can be associated with sentiments.

Is this the end of the world as we know it? - Text mining Donald Trump’s UN covfefe

Tamás Nagy @nagyt

20th September, 2017

Exploring the text

Checking how the speech sentiment develops over time, and what countries are mentioned

Distinct emotions associated with each country