AnDarwin: Scalable Detection of Semantically Similar Android Applications

Jonathan Crussell, Clint Gibler, and Hao Chen

The popularity and utility of smartphones rely on their vibrant
application markets; however, plagiarism threatens the long-term
health of these markets.We present a scalable approach to detecting
similar Android apps based on their semantic information. We implement
our approach in a tool called AnDarwin and evaluate it on 265,359 apps
collected from 17 markets including Google Play and numerous
third-party markets. In contrast to earlier approaches, AnDarwin has
four advantages: it avoids comparing apps pairwise, thus greatly
improving its scalability; it analyzes only the app code and does not
rely on other information --- such as the app's market, signature, or
description --- thus greatly increasing its reliability; it can detect
both full and partial app similarity; and it can automatically detect
library code and remove it from the similarity analysis. We present
two use cases for AnDarwin: Finding similar apps by different
developers ("clones") and similar apps from the same developer
("rebranded"). In ten hours, AnDarwin detected at least 4,295 apps
that have been the victims of cloning and 36,106 apps that are
rebranded. By analyzing the clusters found by AnDarwin, we found 88
new variants of malware and identified 169 malicious apps based on
differences in the requested permissions. Our evaluation demonstrates
AnDarwin's ability to accurately detect similar apps on a large scale.