Capstone Projects

Data Science Capstone Research Projects


Code Honesty

Xiao Wang, Zoe Li, Sizhu Chen, and Kevin Zhou

We recognized that code plagiarism is a serious issue in undergraduate programming courses and attempted to tackle it. Unlike the current string matching based algorithm, we developed an algorithm to compare the Abstract Syntax Tree of source codes and then calculate a similarity score that captures a deeper semantic of the source codes than the widely used anti-plagiarism system MOSS. Additionally, we are able to filter out the influence of template codes and thus reduce the false-positive rate. We have wrapped up the algorithm into a functional website, in which you can upload your codes or select from our example codes to interact with us and receive an immediate result.

Superbowl Commercial

Will Bates and Chatson Frankenberg

The Super Bowl is annually the most watched television program, which means companies are willing to spend up to $5.6 million for a mere 30 seconds of the nation’s attention. In parallel to the financial evidence, the commercials have been described anecdotally as “ambitious”, “notorious”, and “must-have guests [to the main event]”. The cultural phenomenon of the Super Bowl commercial is well-recognized. Thus, the cultural analytics question becomes how
modern data analysis techniques can help us to better decipher what makes these commercials so special and generally abnormal. To generate these results, we analyzed quantitative differences between television commercials that air during the Super Bowl and those that air any other time. As a branch of cultural analytics, this project seeks to determine the capability of data processing to validate or otherwise opine on the social consensus regarding the commercials. Our project will be more focused on the social dynamics of the commercials. We have analyzed a decade worth of Super Bowl commercials, comparing them to counterparts in visual and auditory attributes. Using techniques such as shot detection, spectral transformations, and text analysis, this project offers a complete discussion of the concept of attention-getting in advertising from an analytical viewpoint.

Exploration of Graph Embedding Technique

Jeff Liao, Yuxin Zou, and Zheng Hao Tang

Due to the abundance of Android users and the open source nature of the OS, there is an exorbitant amount of malicious malware apps. Malware detection systems can help prevent attackers from taking control of a user’s device. Our method expands on HinDroid to take advantage of the Heterogeneous Information Network, which is used to extract graph embeddings for both App and API nodes. We applied word2vec, node2vec and metapath2vec to the network, and empirically proved they are able to capture a longer chain of relationships between APIs. We show that using these different graph embedding techniques can still achieve similar accuracies to that of HinDroid's.’


Nicholas Smith

This project is a non-classical application of unsupervised learning which includes both generative art and natural clustering. The data set is a sample of roughly 3000 Minecraft skins, a 64×64 RGBA image, that is wrapped on top of each player’s character and can be seen in game.

Malware Category Detection

Karan Sunil, Nancy Vuong, and Kevin Elkin

With 20% of Android apps on the google play store being malicious, detection of malware apps has become increasingly important. We performed a static code analysis to get a better understanding of which APIs in the code are responsible for the malicious behaviour. Using relationships between APIs and apps we create a heterogeneous information network. Different metapaths (kernels) in the network can help us predict the category of malware that an app belongs to. We considered 5 broad categories of malware – benign, trojan, ransom, backdoor and adware. Different kernels were good at classifying different categories of malware and hence we decided to use a multi kernel model to get the best of all kernels. In cyber-security understanding your model is crucial. Imagine the ramifications of a hacker understanding your model better than you. To better understand our model, we studied our model on two levels – app level, API level. For the API level we looked at the correlation between APIs and the classification result. This helped us recognise which APIs were important. Further, we developed a ranking algorithm to identify APIs that were unique to a specific category of malware. We then used APIs that were unique to benign to see if we could trick our model into predicting malware apps as benign. To understand how specific apps affected our classification output, we analysed the SVM weights of our model and looked into how apps cluster together using tSNE.