Fake News Detector

Do you know the truth? Try our app to find out!

Try our app here




Result:

Why a fake news detector?

Social media's influence

As millions of new accounts are created across multiple social media platforms, it has become very convenient for the spread of false news. One fake news can spread like fire through the wide reach of social media.

Conflicting news sources

Ever read two articles on the same topic and noticed different information? News agencies tend to have bias sometimes and fail to display the whole truth. It is necessary for the public to know what news source to trust.

Dangerously misleading

Sometimes false news can be fatal to the ones reading them. Reading fake news about viruses, terrorists and lead the general public to do harm to themselves and others.

Morally questionable

Ever read a news article a said, what that doesn't sound right? Sometimes news sources publish articles that make the people say to themselves: WHAAAAT?

Dataset

For our product, we collected our data utilizing other datasets from Kaggle. Originally, each dataset comprised of different categories to differentiate each news article, including dates, titles, text, authors, languages, and a real vs. fake classification. Additionally, the dates of each article varied between March 2015 - February 2018. In total, 51,233 news articles were incorporated into our final product, 26,645 of which were fake news, while 24,588 were real news. Furthermore, including both text and titles, a total of 100,575,991 words were used!

Data Cleaning + Pre-Processing

Categorizing

To prepare for training and sampling, we first boiled our data down into the categories "Title + Text"(or "tt") and "Real/Fake", thus removing the categories "Date" and "Type" which we deemed less relevant to training our model.

Tokenization + Lowercase

Next, we employed tokenization and converted everything to lowercase.

Separation + Typos

The initial tokenization left some words combined, so we followed up by separating words combined at periods, different data types, capital letters, non-existent words, and contractions.

Dropping Text

We now had an abundance of data and text, but not all of it was useful. The last step to our data cleaning was dropping stopwords, short words, text within parentheses, possessives, special characters, and links (RegEx).

Artificial Intelligence Training Models

Logistic Regression

The sigmoid function (in red) works much better with binary outputs, squeezing everything down between 0 and 1. This also allows the function to be differentiable. The sigmoid function is used on its own in simple neural nets, and also inside cells in neural nets like LSTM which are more complicated. It is best used for binary outputs where there are only two cases. 
σ = 1 / (1 + e^-x)

Support Vector Machine (SVM)

SVMs look for a hyperplane (line in 2D, plane in 3D, ect.) dividing two populations. It attempts to maximize the distance between the division and the closest data points, also known as support vectors. These support vectors "guide" the hyper plane to divide the two populations as good as possible. In cases where the data is not linearly divided, we can use a kernel to project the data onto a higher dimension. In this new dimension, we can draw a new hyper plane that divides the populations.

Image from https://towardsdatascience.com/svm-feature-selection-and-kernels-840781cc1a6c 

Naive Bayes

Naive Bayes asks the question: what is the probability that A happens given B? In our case, it asks: "What is the probability of the article being fake, given the text?" The neural net learns which probabilities on the right contribute most to the final probability of the article being fake. So instead of adjusting weights and biases, the neural net will adjust the probabilities instead.

LSTM

LSTMs solve the problem of the vanishing gradient by passing along a cell state (long term memory) as well as the previous output (short term memory). It uses the cell state, the previous output AND the new input to determine its value. The cell state is modified through a forget gate and an input gate. The forget gate tells the cell state what to forget and what to keep, while the input gate adds new information from the input. This is particularly useful when analyzing long sentences. A normal RNN might forget what happened at the beginning of the sentence, while LSTM can use the entire sentence as contex to calculate its output.

Evaluations and Calculations

For our evaluations, we used a confusion matrix to figure out how effective our model is. 

First of all, we used accuracy to measure how often the classifier was correct, while the error rate measured how often the classifier was incorrect. For our model, the accuracy on the testing set was about 0.9897550111358575. 

We also used recall to determine how often true is predicted when the result is actually true. To calculate the recall rate, we divided the true positive values by the total number of actual true values. The recall rate of our testing set was around 0.9893469198703103.

Lastly, we used precision to measure how often a predicted true value is actually correct. To calculate precision, we divided all the true positive values by the number of predicted true values.  For our testing set, the precision was 0.9893469198703103.

Overall, our training model was pretty accurate because we were able to get around 99% of accuracy across all tests and calculations.

To visualize the performance of our model(s), we constructed Receiver Operating Characteristic(ROC) curves and calculated their Area Under the Curves(AUC). 





Each of our model(s) produced two predicted probability distribution curves; one for 'Fake News' and one for 'Real News'. Using the two distribution curves, we then generated a confusion matrix and calculated the True Positive Rate(TPR) and False Positive Rate(FPR) based on a probability threshold or "cutoff". 

These pairs of numbers were then plotted on an FPR vs. TPR graph to form a ROC curve. The area under the ROC curve is what we're focused on. A score of 1 is perfect classification, 0.5 is random classification, and 0 is inverted classification.




 
The AUC for all of our models produced extremely high scores, between 0.9 and 1, indicating that our classification was extremely good. In other words, our algorithm separated 'Fake News' and 'Real News' very well in the datasets we used.

While our dataset and model was quite accurate, we still ran into some problems. For instance, our dataset was unfortunately quite limited due to the lack of availability of data near 2021. Instead, we had to focus our data on news and events which occurred from 2015 to 2018. As a result, later news from years such as 2020 might not be as accurate as news from 2016. Additionally, another problem with our model could be overfitting. Either our dataset was very good at distinguishing the difference between real and fake news or our model is working very well.

Let's hear from the team's experience at AI Camp!

"AI Camp introduced new concepts and broadened my horizons through Python exercises and machine learning lectures. Despite the fast-paced curriculum, I was able to pick up an understanding for training models and evaluating their performances along the way. My team communicated frequently and bounced ideas off of each other which was monumental in helping us overcome hitches in the process of designing our website and navigating multiple CoCalc platforms. I was proud to witness the hard work we poured into our final project come to fruition, and am grateful to have had the opportunity to build a foundation for my future projects!"

Audrey Lau
Data scientist, Mathematician

"Overall, I thought my experience at AI Camp was fantastic. At first, I thought the camp was a bit stressful because I was not too knowledgable about Python, in particular. However, the lectures and small group teachings really helped out in learning the fundamentals of both Python and AI. I even enjoyed our final product, the fake news detector, because I believe that it is a key issue which must be addressed. Some downsides of the camp, in my opinion, were that the fast pace can be a bit challenging at times and lectures were sometimes a bit too long. Aside from that, I loved AI Camp and I would recommend it for anyone trying to learn Python and AI."

Ishan Khillan
Mathematician, Web Developer

My experience at AI Camp was very insightful. Through morning lectures, afternoon projects, and Friday game days, I was able to learn about complex topics in AI in a fast-paced environment alongside like-minded individuals. Although we faced some technical difficulties when putting our website together, we were ultimately able to establish a more organized system through communication to ensure that our project could be completed in time. Overall, I'm really grateful to have had the opportunity to be part of AI Camp because I was not only able to expand my knowledge in Python, but I was also able to implement machine learning and AI training models to large-scale products.

Bernice Lau
Data Scientist, Product Manager

"AI Camp is a great way to get into AI and computer science in general. There was plenty of hands-on experience as well as clear and concise lectures. The most challenging part was trying to wrap my head around each of the models and how they worked. However, it was very satisfying to finally fully understand what made each one tick."





Eric Wan
ML Engineer

"I had a very memorable experience at AI Camp. From the start, I was able to understand a lot of the content as I had previously learned Python and other programming languages, so keeping up with the fast pace wasn’t a huge problem for me. Although I did struggle in understanding the applications of the complex Python libraries such as PyTorch, LSTM, etc, but with the repeated explanations by our instructor I was able to figure out it along the way. I benefitted a lot from the individualized attention that was given to us in the camp and I’m pleased with the way our product turned out here at AI Camp! This camp has really helped me set a basis for future AI projects I would like to work on later in high school and through college!"

Datta Kansal
Data Scientist & Product Manager

AI Camp teaches AI to middle and high school students.

We make cool AI products in just three weeks!

Learn AI at AI Camp