Fake Text Detecting with Machine Learning: Trump Case Study

Fake Text Detecting: Case Study Trump

Twitter bots and fake social media accounts made the headlines back in 2016 when they were proven to have shifted the US election results. This year, a new group of presidential candidates are fighting for the seat in the White House. They are also actively using Twitter as their communication channel. It made me wonder, what does it take to credibly fake someone’s online presence and how far has technology advanced in recognizing the culprits since the last election? Would machine learning help in detecting fake text?

Text classifier: real Trump or twitter bot?

To start with this fun project, I decided to first build a classifier to try distinguishing real Trump tweets from fakes. In the following paragraphs, you can see the steps I took from downloading the data to preprocessing to classification.

I’ve loaded a couple thousand tweets from the following accounts: the actual Trump twitter (@realDonaldTrump), the parody Trump (@RealDonalDrumpf) and the best deep fake bot account that I could find – (@DeepDrumpf).

Next, I removed stopwords, extra punctuation, mentions and links. Then I visualised the collections of tweets per each account to reveal any oddities and most commonly used words. I have not removed the retweet sign “rt” as I found that it could be an indicator of tweeting style. Also the bots might not be trained to retweet the same way as Donald Trump.

Wordclouds @realDonaldTrump, @DeepDrumpf and @RealDonalDrumpf

The difference between the real Trump’s vocabulary and the fake text accounts is quite apparent. In fact, I looked at the top 100 words used by each 3 accounts. I found that there is nothing in common between real Trump and the parody and deep fake Trump accounts. However, the fakes themselves share a great deal of vocabulary. In particular (and not very surprisingly), on the top of the list we see GREAT, COUNTRY, MAKE, AMERICA.

*10 most frequent words in common between the fake accounts.*

It is noteworthy that Deep Fake Trump bot was discontinued in 2017. Therefore its tweets might capture the style, but not the current vocabulary of Mr Trump.

Tweet text vectorization and classification

Before proceeding to classification, I tokenized and lemmatized the tweets using spaCy. In practical terms, it means splitting tweets into lists of word-strings and normalizing them, by reducing the inflected word to its common base. So, for example “making” and “made” become “make” and are analysed as one unit of vocabulary. Subsequently, I vectorized the resulting set of tokens using Bag of Words (CountVectorizer in sklearn) and TF-IDF (TfidfVectorizer in sklearn) methods. For the classifier, I chose LinearSVC that tends to work well on NLP classification problems. High-dimensional text data and support vectors are a match made in heaven.

For the two vectorization methods I’ve got the following results using LinearSVC (Chart 2):

Chart 2: Tweet text classification with vectorization

The accuracy of predictions of whether the tweets contain real or fake text is high in all 4 cases. The TF-IDF method has proven to give slightly better results than classification with Bag of Words vectorization. All in all, it is relieving that the algorithm was better at distinguishing between the real and bot account, than between the actual US president tweets and the human-lead parody account. This goes to show that algorithmic text generation has not yet been perfected. It is not so easy to fool either machines, or humans when it comes to assessing genuinity of public figures’ written communication online.

Conclusion & next steps

Machine learning is tremendously helpful in classification tasks. However, when it comes to language, it seems that conducting linguistic and social media pattern analysis yields more comprehensive results than throwing a black-box solution at the problem. In my process, the differences between the accounts could have been spotted already when analyzing the vocabulary. Others have also looked at alternative ways to spot a bot, like their activity. Bots tend to tweet too often and at very specific times (like every full hour), while humans tweet around 10-15 times a day at random times.

As the next step, I would like to employ more advanced analysis to decode what it means to tweet like Donald Trump.

Stay tuned!

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFlare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
ARRAffinity	session	ARRAffinity cookie is set by Azure app service, and allows the service to choose the right instance established by a user to deliver subsequent requests made by that user.
ARRAffinitySameSite	session	This cookie is set by Windows Azure cloud, and is used for load balancing to make sure the visitor page requests are routed to the same server in any browsing session.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non-necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	5 months 27 days	Used to store consent of guests regarding the use of cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
tableau_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
tableau_public_negotiated_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
_dc_gtm_UA-111640802-1	1 minute	This cookie is used by Google Tag Manager to support Google Analytics on our Sites. It helps us monitor the use and performance of our Sites.
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga_JWW0KP3X8Q	2 years	This cookie is installed by Google Analytics 4.
_gat_UA-111640802-1	1 minute	This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
ai_session	30 minutes	This is a unique anonymous session identifier cookie set by Microsoft Application Insights software to gather statistical usage and telemetry data for apps built on the Azure cloud platform.
ai_user	1 year	A unique user identifier cookie, set by Microsoft Application Insights software, that enables counting of the number of users accessing the application over time.
AnalyticsSyncHistory	1 month	Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
prism_252943399	1 month	This cookie is used by Active Campaign for site tracking purposes.
visitorId	1 year	By default, the visitor ID is supplied to Coveo UA using the visitor (string) query parameter and kept in the local storage of the user browser. A third-party cookie can also be used to store the visitor ID if the current user browser accepts these kinds of cookies.
WFESessionId	session	These cookies are used by Microsoft Azure Application Insights, which collects site telemetry information, allowing us to analyze how some of our Sites are performing and to perform optimization.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
LinkedIn
muc_ads	2 years	Collects data on user behaviour and interaction in order to optimize the website and make advertisement on the website more relevant.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.

Cookie	Duration	Description
CONSENT	16 years 7 months 20 days 16 hours 15 minutes	No description
GetLocalTimeZone	session	No description
hid	session	No description available.

Fake Text Detecting with Machine Learning: Trump Case Study

Fake Text Detecting: Case Study Trump

Text classifier: real Trump or twitter bot?

Tweet text vectorization and classification

Conclusion & next steps

References & more

Details

Computer Vision: Create an API in 60 minutes

Data Governance Roles and Responsibilities

Guiding C-Level Executives Through Business Ethics in the Data and AI Age

DAIN Studios

Studio HELSINKI

Studio BERLIN

Studio MUNICH