fbpierazzi.github.io.old/index.json at master · fbpierazzi/fbpierazzi.github.io.old · GitHub

1
[{"authors":["admin"],"categories":null,"content":"I am a Lecturer (Assistant Professor) in Cybersecurity at the Department of Informatics of King\u0026rsquo;s College London (KCL), where I am also Deputy Head of the Cybersecurity Group (CYS), and Programme Leader of the MSc in Cyber Security. I also closely collaborate with UCL\u0026rsquo;s Systems Security Research Lab (S2Lab).\nMy research interests lie at the intersection of AI and cybersecurity, with particular focus on systems security, adversarial ML, malware analysis, and network security.\nI completed my Ph.D. in Computer Science as a member of WEBLab at University of Modena (Italy) in March 2017, and I spent most of 2016 as a visiting research scholar at University of Maryland, College Park (US). I have then held a two-year PostDoctoral position in the S2Lab in the UK (first at Royal Holloway, University of London, then at King\u0026rsquo;s College London).\nSpecial Issue **I am one of the guest editors of a [Special Issue on \"Offensive Machine Learning\"](https://dl.acm.org/pb-assets/dtrap/OffensiveMLSpecialIssue-1612112373120.pdf) as part of ACM DTRAP journal. _Deadline: July 30th, 2021_.** -- Workshop **I am Program Co-Chair of the [1st ACM Workshop on Robust Malware Analysis (WoRMA)](https://worma.gitlab.io/2022/), co-located with ACM AsiaCCS.** -- I\u0026rsquo;m Hiring! I am recruiting Ph.D. students to join my research team at King\u0026rsquo;s College London. If you are interested, read this page. Please note that I do not have PostDoc positions at the moment, but keep an eye on this page for future announcements.\n","date":-62135596800,"expirydate":-62135596800,"kind":"taxonomy","lang":"en","lastmod":-62135596800,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"/authors/admin/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/admin/","section":"authors","summary":"I am a Lecturer (Assistant Professor) in Cybersecurity at the Department of Informatics of King\u0026rsquo;s College London (KCL), where I am also Deputy Head of the Cybersecurity Group (CYS), and Programme Leader of the MSc in Cyber Security. I also closely collaborate with UCL\u0026rsquo;s Systems Security Research Lab (S2Lab).\nMy research interests lie at the intersection of AI and cybersecurity, with particular focus on systems security, adversarial ML, malware analysis, and network security.","tags":null,"title":"Fabio Pierazzi","type":"authors"},{"authors":["Limin Yang","Zhi Chen","Jacopo Cortellazzi","Feargus Pendlebury","Kevin Tu","Fabio Pierazzi","Lorenzo Cavallaro","Gang Wang"],"categories":null,"content":"","date":1691622000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1691622000,"objectID":"1f92774f903b2050edac3287f7577dca","permalink":"/publication/jigsaw/","publishdate":"2022-03-20T00:00:00Z","relpermalink":"/publication/jigsaw/","section":"publication","summary":"Malware classifiers are subject to training-time exploitation due to the need to regularly retrain using samples collected from the wild. Recent work has demonstrated the feasibility of backdoor attacks against malware classifiers, and yet the stealthiness of such attacks is not well understood. In this paper, we investigate this phenomenon under the clean-label setting (i.e., attackers do not have complete control over the training or labeling process). Empirically, we show that existing backdoor attacks in malware classifiers are still detectable by recent defenses such as MNTD. To improve stealthiness, we propose a new attack, Jigsaw Puzzle (JP), based on the key observation that malware authors have little to no incentive to protect any other authors' malware but their own. As such, Jigsaw Puzzle learns a trigger to complement the latent patterns of the malware author's samples, and activates the backdoor only when the trigger and the latent pattern are pieced together in a sample. We further focus on realizable triggers in the problem space (e.g., software code) using bytecode gadgets broadly harvested from benign software. Our evaluation confirms that Jigsaw Puzzle is effective as a backdoor, remains stealthy against state-of-the-art defenses, and is a threat in realistic settings that depart from reasoning about feature-space only attacks. We conclude by exploring promising approaches to improve backdoor defenses.\n","tags":[],"title":"Jigsaw Puzzle: Selective Backdoor Attack to Subvert Malware Classifiers","type":"publication"},{"authors":["Giovanni Apruzzese","Hyrum Anderson","Savino Dambra","David Freeman","Fabio Pierazzi","Kevin Alejandro Roundy"],"categories":null,"content":"","date":1675814400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1675814400,"objectID":"444e009c251f02efbca94ec0fc33d021","permalink":"/publication/realgradients/","publishdate":"2022-11-06T00:00:00Z","relpermalink":"/publication/realgradients/","section":"publication","summary":"","tags":[],"title":"Position: “Real Attackers Don’t Compute Gradients”: Bridging the Gap Between Adversarial ML Research and Practice","type":"publication"},{"authors":["Daniel Arp","Erwin Quiring","Feargus Pendlebury","Alexander Warnecke","Fabio Pierazzi","Christian Wressnegger","Lorenzo Cavallaro","Konrad Rieck"],"categories":null,"content":"","date":1660086000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1660086000,"objectID":"d17fb8a1c745b80785142de9def0b921","permalink":"/publication/dodo/","publishdate":"2021-08-19T00:00:00+01:00","relpermalink":"/publication/dodo/","section":"publication","summary":"With the growing processing power of computing systems and the increasing availability of massive datasets, machine learning algorithms have led to major breakthroughs in many different areas. This development has influenced computer security, spawning a series of work on learning-based security systems, such as for malware detection, vulnerability discovery, and binary code analysis. Despite great potential, machine learning in security is prone to subtle pitfalls that undermine its performance and render learning-based systems potentially unsuitable for security tasks and practical deployment.\nIn this paper, we look at this problem with critical eyes. First, we identify common pitfalls in the design, implementation, and evaluation of learning-based security systems. We conduct a study of 30 papers from top-tier security conferences within the past 10 years, confirming that these pitfalls are widespread in the current security literature. In an empirical analysis, we further demonstrate how individual pitfalls can lead to unrealistic performance and interpretations, obstructing the understanding of the security problem at hand. As a remedy, we propose actionable recommendations to sup- port researchers in avoiding or mitigating the pitfalls where possible. Furthermore, we identify open problems when applying machine learning in security and provide directions for further research.\n","tags":[],"title":"Dos and Don'ts of Machine Learning in Computer Security","type":"publication"},{"authors":["Federico Barbero","Feargus Pendlebury","Fabio Pierazzi","Lorenzo Cavallaro"],"categories":null,"content":"","date":1653001200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1653001200,"objectID":"16aa8391eafb3766489191b0e64b5f63","permalink":"/publication/transcending/","publishdate":"2021-08-19T00:00:00+01:00","relpermalink":"/publication/transcending/","section":"publication","summary":"Machine learning for malware classification shows encouraging results, but real deployments suffer from performance degradation as malware authors adapt their techniques to evade detection. This evolution of malware results in a phenomenon known as concept drift, as new examples become less and less like the original training examples. One promising method to cope with concept drift is classification with rejection in which examples that are likely to be misclassified are instead quarantined until they can be expertly analyzed.\nWe revisit Transcend, a recently proposed framework for performing rejection based on conformal prediction theory. In particular, we provide a formal treatment of Transcend, enabling us to refine conformal evaluation theory---its underlying statistical engine---and gain a better understanding of the theoretical reasons for its effectiveness. In the process, we develop two additional conformal evaluators that match or surpass the performance of the original while significantly decreasing the computational overhead. We evaluate our extension on a large dataset that removes sources of experimental bias present in the original evaluation.\nFinally, to aid practitioners, we determine the optimal operational settings for a Transcend deployment and show how it can be applied to many popular learning algorithms.\nThese insights support both old and new empirical findings, making Transcend a sound and practical solution, while shedding light on how rejection strategies may be further applied to the related problem of evasive adversarial inputs.\n","tags":[],"title":"Transcending Transcend: Revisiting Malware Classification in the Presence of Concept Drift","type":"publication"},{"authors":null,"categories":null,"content":" This article reports a brief summary of the main datasets we have released for malware research. I will try to keep this list updated with new entries, and use the \u0026ldquo;changelog\u0026rdquo; at the end to track major changes to this article.\nDatasets Available The timestamped malware datasets which we have released for research are the following:\n Tesseract dataset (apps from 2014 to 2016): malware downloaded from AndroZoo with extracted feature spaces for DREBIN and MaMaDroid. S\u0026amp;P20 APG dataset (apps from 2017 to 2018): malware downloaded from AndroZoo with extracted feature spaces for DREBIN and MaMaDroid.  Please note that we are not releasing the goodware/malware directly, but instead only the SHAs of the apps we considered. To re-download the original apks, you can re-download them from AndroZoo.\nLoading Features We did separate the dataset into three JSON files: X, Y, and meta. The following function is used to load the dataset with timestamps:\nimport datetime import json import logging import time def load_features(fname, shas=False): \u0026quot;\u0026quot;\u0026quot;Load feature set. Args: feature_set (str): The common prefix for the dataset. (e.g., 'data/features/drebin' -\u0026gt; 'data/features/drebin-[X|Y|meta].json') shas (bool): Whether to include shas. In some versions of the dataset, shas were included to double-check alignment - these are _not_ features and _must_ be removed before training. Returns: Tuple[List[Dict], List, List]: The features, labels, and timestamps for the dataset. \u0026quot;\u0026quot;\u0026quot; logging.info('Loading features...') with open('{}-X.json'.format(fname), 'r') as f: X = json.load(f) # if not shas: # [o.pop('sha256') for o in X] logging.info('Loading labels...') with open('{}-Y.json'.format(fname), 'rt') as f: y = json.load(f) if 'apg' not in fname: y = [o[0] for o in y] logging.info('Loading timestamps...') with open('{}-meta.json'.format(fname), 'rt') as f: t = json.load(f) t = [o['dex_date'] for o in t] if 'apg' not in fname: t = [datetime.strptime(o if isinstance(o, str) else time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(o)), '%Y-%m-%dT%H:%M:%S') for o in t] else: t = [datetime.strptime(o if isinstance(o, str) else time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(o)), '%Y-%m-%d %H:%M:%S') for o in t] return X, y, t  Please remember to remove any SHAs from the dataset and do not consider them as features.\nMemory Errors If you are a BSc/MSc student doing a dissertation, and you are relying on our datasets, but do not have access to a powerful server, you may want to consider \u0026ldquo;downsampling\u0026rdquo; strategies to reduce the size of the dataset to make it more manageable.\nChangelog  24/03/2022: v1.0 published  ","date":1647907200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1647907200,"objectID":"b7cf78fc513d70634e912fac8d060c7a","permalink":"/post/tutorial/malwaredatasets/","publishdate":"2022-03-22T00:00:00Z","relpermalink":"/post/tutorial/malwaredatasets/","section":"post","summary":"This post covers some of the research datasets we have built for research.","tags":["tutorial","python","dataset","malware","loading","features","timestamps"],"title":"Malware Datasets with Timestamps","type":"post"},{"authors":["Giuseppina Andresini","Feargus Pendlebury","Fabio Pierazzi","Corrado Loglisci","Annalisa Appice","Lorenzo Cavallaro"],"categories":null,"content":"","date":1636502400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1636502400,"objectID":"80449904828b0526fb11d60070640707","permalink":"/publication/insomnia/","publishdate":"2021-09-29T00:00:00+01:00","relpermalink":"/publication/insomnia/","section":"publication","summary":"Despite decades of research in network traffic analysis and incredible advances in artificial intelligence, network intrusion detection systems based on machine learning (ML) have yet to prove their worth. One core obstacle is the existence of concept drift, an issue for all adversary-facing security systems. Additionally, specific challenges set intrusion detection apart from other ML-based security tasks, such as malware detection.\n\nIn this work, we offer a new perspective on these challenges. We propose INSOMNIA, a semi-supervised intrusion detector which continuously updates the underlying ML model as network traffic characteristics are affected by concept drift. We use active learning to reduce latency in the model updates, label estimation to reduce labeling overhead, and apply explainable AI to better interpret how the model reacts to the shifting distribution.\n\nTo evaluate INSOMNIA, we extend TESSERACT—a framework originally proposed for performing sound time-aware evaluations of ML-based malware detectors—to the network intrusion domain. Our evaluation shows that accounting for drifting scenarios is vital for effective intrusion detection systems.\n","tags":[],"title":"INSOMNIA: Towards Concept-Drift Robustness in Network Intrusion Detection","type":"publication"},{"authors":["Zeliang Kan","Feargus Pendlebury","Fabio Pierazzi","Lorenzo Cavallaro"],"categories":null,"content":"","date":1636502400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1636502400,"objectID":"4372d2c580058fb8a682a323cd57b5d4","permalink":"/publication/deplusplus/","publishdate":"2021-09-29T00:00:00+01:00","relpermalink":"/publication/deplusplus/","section":"publication","summary":"The evolution of malware has long plagued machine learning-based detection systems, as malware authors develop innovative strategies to evade detection and chase pro\u001bts. This induces concept drift as the test distribution diverges from the training, causing performance decay that requires constant monitoring and adaptation.\n\nIn this work, we analyze the adaptation strategy used by DroidEvolver, a state-of-the-art learning system that self-updates using pseudo-labels to avoid the high overhead associated with obtaining a new ground truth. After removing sources of experimental bias present in the original evaluation, we identify a number of \u001eaws in the generation and integration of these pseudo-labels, leading to a rapid onset of performance degradation as the model poisons itself. We propose DroidEvolver++, a more robust variant of DroidEvolver, to address these issues and highlight the role of pseudo-labels in addressing concept drift. We test the tolerance of the adaptation strategy versus di\u001derent degrees of pseudo-label noise and propose the adoption of methods to ensure only highquality pseudo-labels are used for updates.\n\nUltimately, we conclude that the use of pseudo-labeling remains a promising solution to limitations on labeling capacity, but great care must be taken when designing update mechanisms to avoid negative feedback loops and self-poisoning which have catastrophic e\u001dects on performance.\n","tags":[],"title":"Investigating Labelless Drift Adaptation for Malware Detection","type":"publication"},{"authors":null,"categories":null,"content":" This post is targeted mostly for BSc/MSc students doing their dissertations, or early career Ph.D. students who need support in understanding how to conduct a proper literature review in systems security, and to position your work within the state of the art.\nDisclaimer: Please note that this post represents my own personal opinion (which also evolves over time as I learn more :-) ), based on my experience so far as a researcher working in Systems Security \u0026amp; AI. My intent is solely to provide initial recommendations to early career research students tackling their first large project who want to understand better how to conduct a literature review. If you have any feedback, I\u0026rsquo;d be more than happy to challenge my views and integrate and revise my recommendations!\nIn these years as a faculty member and academic supervisor, I see students struggle more and more with understanding how to conduct a literature review. This is understandable: the field has been growing extensively, and more and more papers get published in reputable venues, so it is hard to create a \u0026ldquo;mental compass\u0026rdquo; to orient yourself in this sea of literature. Even within Computer Science, every discipline has different venues and understanding how to look for related works requires some effort and experience. Indeed, this guide is meant mostly for Systems Security.\nDuring my undergraduate dissertation, I really struggled to understand what it meant to write a \u0026ldquo;Related Work\u0026rdquo; section. I got confused because\u0026mdash;to me\u0026mdash;it looked very similar to the \u0026ldquo;Introduction\u0026rdquo; section. Moreover, I felt intimidated to approach and read \u0026ldquo;research papers\u0026rdquo; written by experts, as I thought you needed to be a \u0026ldquo;genius\u0026rdquo; to read one of those.\nIn practice, a BSc/MSc in Computer Science should give you all the ingredients you need to first approach a research paper. You need to read within your area of expertise, and be patient and persistent: you will not understand everything after one read, and you need to be patient to really understand concepts sometimes. Another important skill is to learn how to navigate the literature, so you can build the foundational knowledge you need to understand more advanced topics. Nevertheless, good research papers\u0026mdash;while technical\u0026mdash;are also written and structured in an extremely clear way, as they are meant to be understood by a technical audience.\nPublication Quality Determining the quality of publications is a hard problem that requires expertise and in-depth critical assessment of what you are reading. However, you may use some criteria to prioritize your literature review and identify papers which are more relevant to your research.\nPublication venues In academia, research articles are mostly published via peer-review. In academic conferences, a program committee of experts reviews the paper independently and determines whether its originality, rigor and significance is sufficient to be published at a certain venue. In academic journals, there is a committee of \u0026ldquo;associate editors\u0026rdquo; which then invite external experts to coordinate the review process.\nHence, publication venue can be typically used as initial proxies of quality.\nKeep in mind there could still be bad papers published in good venues (and viceversa), but indicatively venues that are quite selective tend to contain work of higher quality. I find this could be useful especially when you start as a student to read research paper and you feel \u0026ldquo;lost in the sea\u0026rdquo;.\nMy list does not intend to be exhaustive, but just indicative.\nTwo highly recommended reads are the following:\n Influental Security Papers, by Konrad Rieck, TU Braunschweig System Security Circus, by Davide Balzarotti, S3@Eurecom  Conferences Academic conferences typically have one edition per year in which authors meet together to discuss and present the works published in that year\u0026rsquo;s proceedings. Now it has become more frequent to have multiple submission deadlines throughout the years, but each conference remains with a yearly edition in which works are presented.\nIn Systems Security, there are typically four main academic conferences which are commonly considered top-tier conferences by the community:\n IEEE Symposium on Security \u0026amp; Privacy (S\u0026amp;P, Oakland) USENIX Security Symposium (Sec) ACM Computers and Communication Security (CCS) The Network and Distributed System Security (NDSS) Symposium  Apart from the \u0026ldquo;top four\u0026rdquo;, other highly reputable venues in Systems Security include (but are not limited to):\n IEEE European Symposium on Security and Privacy (EuroS\u0026amp;P) Annual Computer Security Applications Conference (ACSAC) ACM ASIA Conference on Computer and Communications Security (ASIACCS) International Symposium on Research in Attacks, Intrusions and Defenses (RAID) Conference on Detection of Intrusions and Malware \u0026amp; Vulnerability Assessment (DIMVA) European Symposium on Research in Computer Security (ESORICS) ACM Conference on Data and Application Security and Privacy (CODASPY)  There are also top workshops on specific sub-topics, such as: ScAInet, AISec, w00t, DLS, which are often co-located with top conferences. Workshops typically publish preliminary results or interesting new ideas, which maybe do not have yet a sufficiently large experimental evaluation or fully fledged theory behind them.\nSometimes work in Systems Security can also be published in closely-related venues but more focused on Data Mining, ML, AI. For example, The Web Conference ACM which is a first-tier conference in data mining has a Security \u0026amp; Privacy track. Other top tier venues in data mining, AI and ML include: ICML, ICLR, WSDM, NeurIPS, IJCAI, AAAI.\nJournals Journals in Systems Security are typically used for extensions or consolidation of research results of a conference version, but often host also original research works.\nHere is a rough list of reputable journals in Systems Security:\n ACM Transactions on Privacy and Security (TOPS; ex-TISSEC) IEEE Transactions on Information Forensics and Security (TIFS) IEEE Transactions on Dependable and Security Computing (TDSC) Journal of Computer Security Elsevier Computers \u0026amp; Security  Typically, ACM and IEEE journals with the name \u0026ldquo;Transactions\u0026rdquo; in it tend to be the more prestigious ones, but there are also other highly reputable journals which build a reputation for themselves.\nIn the field of AI and ML, some very good journals are also Pattern Recognition, and JMLR.\nNumber of citations As humans, we like to measure things. Many systems refer to number of citations as a proxy for quality and impact. While this is not necessarily true, papers with high number of citations may be either false positives (e.g., papers cited just because people searched for \u0026ldquo;computer worm\u0026rdquo; and cited the first result on scholar), or actually be very influential papers that heavily influenced research in a certain field.\nTypically, you should also look out for \u0026ldquo;Test of Time Awards\u0026rdquo;, which are given to influental papers typically after 10 years of their publications.\nTechnical reports, white papers and pre-prints Industry frequently publishes technical reports or white papers on some technology. Sometimes, also academics do it (although they mostly call them \u0026ldquo;technical reports\u0026rdquo;). While technical reports can undoubtedly be informative, it is important to note that they are not peer-reviewed.\nHence, even when you find paper pre-prints on websites such as arXiv, be sure to double-check if they are published somewhere.\nSometimes the authors\u0026rsquo; name, reputation and affiliation is used as proxy for the quality, but you always need to be more cautious if has not been peer-reviewed, and to critically assess the work itself. It is important to remember, as Richard Feynman said:\n \u0026ldquo;If [a theory] disagrees with experiments, it\u0026rsquo;s wrong. And that simple statement is the key to science. It doesn\u0026rsquo;t make a difference how beautiful your guess is, it doesn\u0026rsquo;t make a difference how smart you are who made the guess, or what his name is: if it disagrees with the experiment, it\u0026rsquo;s wrong. That\u0026rsquo;s all there is to it.\u0026rdquo;\nRichard Feynman\n There can be a variety of valid reasons for which a work isn\u0026rsquo;t published yet (maybe it is currently under review). Nevertheless, it is important that you keep in mind that accepted work has been peer-reviewed by experts, and hence you should generally hold it to a higher standard than technical reports.\nPublication years Keep in mind that research is incremental: findings at time t may be disproved at time t+1. We always build on the shoulders of giants, but it is a neverending process. While it is fundamental not to forget prior research, it is important to check that your references are up to date with the current state of the art. Even 5 years in the field of computer security are a lot of time.\nHow to find related works I try to summarize some useful tips and tricks to find relevant works in your field:\n Paper search engines: it may sound trivial, but I think Google Scholar is your best search engine at the moment for finding research papers. If you do not have access to scholar, also dblp is a good place to look into. Read \u0026ldquo;Related Work\u0026rdquo; sections: If you find one paper that is very close to what you are doing, look at the papers they cite. This is often a good source of information, and provides Look at the \u0026ldquo;Cited by\u0026rdquo; field on Google Scholar: If you click on the \u0026ldquo;Cited by\u0026rdquo; field on Google Scholar, you will also find a list of papers who cited that one, typically sorted by citation number. It can be useful to track more recent advancements of the state of the art. you interesting references to look into. Surveys and Systemization of Knowledge (SoK) papers are also a great place where to understand better the state of the art in a more condensed way. Sometimes also Reading Lists are published by prominent researchers in a field, such as the Adversarial Machine Learning Reading List  from Nicholas Carlini. Search researchers profiles: If a researcher has published a top paper on a certain systems security topic, if you look at their research profile (e.g., personal home page, dblp, or Google Scholar), you may find other interesting works in the topic. Sessions of recent security conferences: Conferences are held once per year, and their presentation program is typically divided into \u0026ldquo;sessions\u0026rdquo; of a certain topic. For example, if you are doing research in Web Security, a conference may have a \u0026ldquo;Web Security\u0026rdquo; session in its program, which can be useful to find relevant works in that area at the moment. Be creative in your keywords search: Keywords and trends change over time, even for the same topic. Unfortunately\u0026mdash;to the best of my knowledge\u0026mdash;most paper search engines rely mostly on paper titles, not much on its content. Hence, you need to be smart in your keywords search, since some papers may be using different terminologies. For example, back in 2016, when I started doing research in Android malware analysis, I looked up for \u0026ldquo;Android clustering\u0026rdquo; on Google Scholar, or \u0026ldquo;mobile malware clustering\u0026rdquo;, and nothing was coming out of my searches. I initially thought no one else did it, until I found out that people was mostly framing it as \u0026ldquo;malware family identification\u0026rdquo;!  \u0026ldquo;There is no work on this topic\u0026rdquo; Students often come back to me saying they did not find anything on a certain topic. Well, there can be two situations:\n You had a novel idea for opening a new research field that no one else had before. Amazing! You did not \u0026ldquo;yet\u0026rdquo; find anything related.  Unfortunately, it is often the second case.\nSometimes you need to be creative in your literature review as well. Let me give an example. Maybe you are doing a research on identifying a new type of network attacks that came out earlier this year. But in most cases, even new attacks rely on component previously proposed in old versions. For example, if you want to identify attacker communications within internal networks, there has been a lot of research on \u0026ldquo;computer worms\u0026rdquo;. Or maybe even research on graphs and social network may be relevant with respect to the algorithms you are using.\nAdvice on reading research papers When you first approach a literature review, you will find tens or hundreds of papers related to your work. It can be quite intimidating, and you may feel overwhelmed by the sheer amount of papers that pile up in your \u0026ldquo;to-read\u0026rdquo; folder. Indeed, reading thoroughly all of the papers published in a field would be a quite challenging task, if not impossible.\nI do have the following advices:\n Be kind on yourself. First of all, accept that you cannot possibly read everything thoroughly. Moreover, understanding is an incremental process. You will not understand 100% of a paper on a first read. At least, I don\u0026rsquo;t. It takes time. The more you read about a certain topic, or re-read certain sections, the more you understand about it. I myself still go back and re-read some seminal papers, and always learn something new from a refreshed perspective. You will continuously grow as a researcher and as a professional, and so will your ability to capture connections and meanings in what you read. Read abstracts before choosing which papers to dive into. Abstracts are there for a reason: to give you an idea of what the paper is contributing to the state of the art. Reading abstracts from many papers should give you an idea of what are the relevant challenges tackled by people, and a small intuition on how they managed to overcome prior works. Of course, abstracts are not sufficient to fully understand a work, but even reading one hundred abstracts becomes a manageable task, versus reading in detail one hundred papers papers. While you are still looking at the literature at a high level, it is also a good practice to skim papers by reading introduction, conclusions, and going over main results and figures. This is just to identify the works which are most relevant with your research, and which you should read more thoroughly. Talk to colleagues/supervisors. People will have advice on which papers to read on a certain topic, especially if they\u0026rsquo;re experts in that sub-area. You can also ask feedback of whether a certain paper you found could be considered relevant or not for your project. Try to (re)implement small ideas. It is often the case that you may get stuck into \u0026ldquo;reading mode\u0026rdquo;, and blocked by the high amount of information. If you try to implement or replicate someone else\u0026rsquo;s approach, see it as a great learning opportunity which will also benefit your literature review: you will start understanding better how other people have solved certain issues, and what are important questions that need answers. You may also identify missing bits in the literature. Reading groups are an option where each person reads and summarizes a recent paper to the other members of the reading group. This helps scaling, and improves networking and presentation skills.  Positioning your work Understanding how to position your work within the state of the art is hard.\nIn your research dissertation/paper, you do not only need to provide a \u0026ldquo;summary\u0026rdquo; of the related work. Positioning implies that you \u0026ldquo;compare and contrast\u0026rdquo; your proposed approach with existing literature.\nHere is some advice I always refer to as a starting point:\n [\u0026hellip;] \u0026ldquo;At a high level what are the differences in what you are doing, and what others have done? Keep this at a high level, you can refer to a future section where specific details and differences will be given. But it is important for the reader to know at a high level, what is new about this work compared to other work in the area.\u0026rdquo;\nJim Kurose (UMass), Writing a Good Introduction\n In the previous quote, the author is talking about the Introduction section, where you should just give a high-level intuition of what you are doing that is novel. The Related Work section is dedicated to diving deeper into that comparison, and answering in more details many questions, such as:\n Is your approach relying on previous techniques? Are you solving a new problem? Are there other works solving similar problems? What makes your approach better than the state of the art? Is it faster? Is it more efficient? Is it more reliable?  Hence, it is not sufficient to just \u0026ldquo;list\u0026rdquo; existing literature, but you really need to explicitly write how each work is related to yours, and where you are improving the state of the art.\nThe \u0026ldquo;Related Work\u0026rdquo; section need to be easily accessible to readers, as there is a risk of listing and comparing against a huge amount of papers and losing the \u0026ldquo;big picture\u0026rdquo;. So, I find it often useful for clarify to identify two/three macro-areas of related research and summarize my contributions with respect to works in each of those areas.\nFinal Considerations I hope you have found this post useful.\nThis is not meant to be an exhaustive list of relevant publication venues, and tips for conducting a literature review. In general, you should see this just as an initial advice to help you navigate more confidently within the state of the art of Systems Security, upon venturing in it for the first time.\nI would be very happy to hear what are your experiences and if you have any other relevant advice into conducting literature review.\nChangelog  v1.0: Nov 16, 2021 v1.0.1: Nov 17, 2021 - Fixed System Security Circus link; added Carlini adversarial reading list  ","date":1635462000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1635462000,"objectID":"3522dc6d3be05697ed95175a8710798b","permalink":"/post/tutorial/literaturereview/","publishdate":"2021-10-29T00:00:00+01:00","relpermalink":"/post/tutorial/literaturereview/","section":"post","summary":"Advices on how to do literature review and related work in the field of computer security.","tags":["tutorial","writing","related work","literature review","systems security"],"title":"How to Review Literature in Systems Security","type":"post"},{"authors":["Raphael Labaca-Castro","Luis Mu\u0026#241;oz-Gonz\u0026aacute;lez","Feargus Pendlebury","Gabi Dreo Rodosek","Fabio Pierazzi","Lorenzo Cavallaro"],"categories":null,"content":"","date":1613433600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1613433600,"objectID":"fe8dbffd2372680088ea22db9f3ee279","permalink":"/publication/gameup/","publishdate":"2021-02-16T00:00:00Z","relpermalink":"/publication/gameup/","section":"publication","summary":"Machine learning classification models are vulnerable to adversarial examples—effective input-specific perturbations that can manipulate the model’s output. Universal Adversarial Perturbations (UAPs), which identify noisy patterns that generalize across the input space, allow the attacker to greatly scale up the generation of these adversarial examples. Although UAPs have been explored in application domains beyond computer vision, little is known about their properties and implications in the specific context of realizable attacks, such as malware, where attackers must reason about satisfying challenging problem-space constraints.\nIn this paper, we explore the challenges and strengths of UAPs in the context of malware classification. We generate sequences of problem-space transformations that induce UAPs in the corresponding feature-space embedding and evaluate their effectiveness across threat models that consider a varying degree of realistic attacker knowledge. Addition- ally, we propose adversarial training-based mitigations using knowledge derived from the problem-space transformations, and compare against alternative feature-space defenses. Our experiments limit the effectiveness of a white box Android evasion attack to ~20% at the cost of ~3% TPR at 1% FPR. We additionally show how our method can be adapted to more restrictive application domains such as Windows malware.\nWe observe that while adversarial training in the feature space must deal with large and often unconstrained regions, UAPs in the problem space identify specific vulnerabilities that allow us to harden a classifier more effectively, shifting the challenges and associated cost of identifying new universal adversarial transformations back to the attacker.\n","tags":[],"title":"Universal Adversarial Perturbations for Malware","type":"publication"},{"authors":["Fabio Pierazzi","Stefano Cristalli","Danilo Bruschi","Michele Colajanni","Mirco Marchetti","Andrea Lanzi"],"categories":null,"content":"","date":1609459200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1609459200,"objectID":"eb0c6fd3601a6b5e540e523a05128837","permalink":"/publication/glyph/","publishdate":"2021-01-01T00:00:00Z","relpermalink":"/publication/glyph/","section":"publication","summary":"Heap spraying is probably the most simple and effective memory corruption attack, which fills the memory with malicious payloads and then jumps at a random location in hopes of starting the attacker’s routines. To counter this threat, Graffiti has been recently proposed as the first OS-agnostic framework for monitoring memory allocations of arbitrary applications at runtime; however, the main contributions of Graffiti are on the monitoring system, and its detection engine only considers simple heuristics which are tailored to certain attack vectors and are easily evaded. In this paper, we aim to overcome this limitation and propose Glyph as the first ML-based heap spraying detection system, which is designed to be effective, efficient, and resilient to evasive attackers. Glyph relies on the information monitored by Graffiti, and we investigate the effectiveness of different feature spaces based on information entropy and memory n-grams, and discuss the several engineering challenges we have faced to make Glyph efficient with an overhead compatible with that of Graffiti. To evaluate Glyph, we build a representative dataset with several variants of heap spraying attacks, and assess Glyph's resilience against evasive attackers through selective hold-out experiments. Results show that Glyph achieves high accuracy in detecting spraying and is able to generalize well, outperforming the state-of-the-art approach for heap spraying detection, Nozzle. Finally, we thoroughly discuss the trade-offs between detection performance and runtime overhead of Glyph's different configurations.\n","tags":[],"title":"GLYPH: Efficient ML-based Detection of Heap Spraying Attacks","type":"publication"},{"authors":["Fabio Pierazzi\u0026#42;","Feargus Pendlebury\u0026#42;","Jacopo Cortellazzi","Lorenzo Cavallaro"],"categories":null,"content":"","date":1582502400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1582502400,"objectID":"ee337cd954f6b06036c4d68727cd8713","permalink":"/publication/apg/","publishdate":"2020-02-24T00:00:00Z","relpermalink":"/publication/apg/","section":"publication","summary":"Recent research efforts on adversarial ML have investigated problem-space attacks, focusing on the generation of real evasive objects in domains where, unlike images, there is no clear inverse mapping to the feature space (e.g., software). However, the design, comparison, and real-world implications of problem-space attacks remain underexplored.\nThis paper makes two major contributions. First, we propose a general formalization for adversarial ML evasion attacks in the problem-space, which includes the definition of a comprehensive set of constraints on available transformations, preserved semantics, absent artifacts, and plausibility. We shed light on the relationship between feature space and problem space, and we introduce the concept of side-effect features as the by-product of the inverse feature-mapping problem. This enables us to define and prove necessary and sufficient conditions for the existence of problem-space attacks. We further demonstrate the expressive power of our formalization by using it to describe several attacks from related literature across different domains.\nSecond, building on our general formalization, we propose a novel problem-space attack on Android malware that overcomes past limitations in terms of semantics and artifacts. Experiments on a dataset with 170K Android apps from 2017 and 2018 show the practical feasibility of evading a state-of-the-art malware classifier, DREBIN, along with its hardened version, Sec-SVM. Our results demonstrate that “adversarial-malware as a service” is a realistic threat, as we automatically generate thousands of realistic and inconspicuous adversarial applications at scale, where on average it takes only a few minutes to generate an adversarial app. Yet, out of the 1300+ papers on adversarial ML published in the past six years, roughly 35 focus on malware---and many remain only in the feature space.\nOur formalization of problem-space attacks paves the way to more principled research in this domain. We responsibly release the code and dataset of our novel attack to other researchers, to encourage future work on defenses in the problem space.\n","tags":[],"title":"Intriguing Properties of Adversarial ML Attacks in the Problem Space","type":"publication"},{"authors":["Fabio Pierazzi","Ghita Mezzour","Qian Han","Michele Colajanni","V.S. Subrahmanian"],"categories":null,"content":"","date":1580860800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1580860800,"objectID":"1e1e830d1e7c4086eb79b0b8ea768282","permalink":"/publication/spyware/","publishdate":"2020-02-05T00:00:00Z","relpermalink":"/publication/spyware/","section":"publication","summary":"","tags":[],"title":"A Data-Driven Characterization of Modern Android Spyware","type":"publication"},{"authors":["Fabio Pierazzi\u0026#42;","Feargus Pendlebury\u0026#42;","Jacopo Cortellazzi","Lorenzo Cavallaro"],"categories":null,"content":"","date":1565823600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1565823600,"objectID":"878465f174c3731151accd4ad643ea3a","permalink":"/publication/apg-poster/","publishdate":"2019-08-15T00:00:00+01:00","relpermalink":"/publication/apg-poster/","section":"publication","summary":"","tags":[],"title":"POSTER: Realistic Adversarial ML Attacks in the Problem-Space","type":"publication"},{"authors":["Feargus Pendlebury\u0026#42;","Fabio Pierazzi\u0026#42;","Roberto Jordaney","Johannes Kinder","Lorenzo Cavallaro"],"categories":null,"content":"","date":1565823600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1565823600,"objectID":"d058c440ff7d328c5ac89cd12ff838a8","permalink":"/publication/tesseract-usenix/","publishdate":"2019-08-15T00:00:00+01:00","relpermalink":"/publication/tesseract-usenix/","section":"publication","summary":"Is Android malware classification a solved problem? Published F1 scores of up to 0.99 appear to leave very little room for improvement. In this paper, we argue that results are commonly inflated due to two pervasive sources of experimental bias: \"spatial bias\" caused by distributions of training and testing data that are not representative of a real-world deployment; and \"temporal bias\" caused by incorrect time splits of training and testing sets, leading to impossible configurations.  We propose a set of space and time constraints for experiment design that eliminates both sources of bias. We introduce a new metric that summarizes the expected robustness of a classifier in a real-world setting, and we present an algorithm to tune its performance. Finally, we demonstrate how this allows us to evaluate mitigation strategies for time decay such as active learning.  We have implemented our solutions in TESSERACT, an open source evaluation framework for comparing malware classifiers in a realistic setting. We used TESSERACT to evaluate three Android malware classifiers from the literature on a dataset of 129K applications spanning over three years. Our evaluation confirms that earlier published results are biased, while also revealing counter-intuitive performance and showing that appropriate tuning can lead to significant improvements.\n","tags":[],"title":"TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time","type":"publication"},{"authors":null,"categories":null,"content":"","date":1563537826,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1563537826,"objectID":"05eb04e0bf2913f1704f64ab9bed2672","permalink":"/project/conceptdrift/","publishdate":"2019-07-19T13:03:46+01:00","relpermalink":"/project/conceptdrift/","section":"project","summary":"","tags":[],"title":"Concept Drift Detection and Remediation","type":"project"},{"authors":null,"categories":null,"content":"","date":1563537826,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1563537826,"objectID":"6094fa21b49482659defc54918e99ed9","permalink":"/project/malware/","publishdate":"2019-07-19T13:03:46+01:00","relpermalink":"/project/malware/","section":"project","summary":"","tags":[],"title":"Malware Analysis","type":"project"},{"authors":null,"categories":null,"content":"","date":1563537826,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1563537826,"objectID":"fae44655346f82c4e1f8c97247b8d851","permalink":"/project/network/","publishdate":"2019-07-19T13:03:46+01:00","relpermalink":"/project/network/","section":"project","summary":"","tags":[],"title":"Network Security Analytics","type":"project"},{"authors":null,"categories":null,"content":" In the past few years, I have been frequently asked how to easily parallelize in Python. It seems that people is often confused by the documentation online and are not sure which solution to go for. So, I decided to write a blog post about a minimum working example of multiprocessing, which should be pretty easy and straightforward to use for a lot of situations.\nRequirements We need to import a couple of libraries: multiprocessing and (optionally) tqdm. In Python3 (stable at 3.7 at the time of writing), the multiprocessing library is natively included; this library allows you to define a function that you can spawn with different parameters on separate processes (it does not perform multithreading).\nimport multiprocessing  The second library I would recommend is tqdm, which is not required but it has made my coding life so much easier since I started using it. This library creates a smart progress bar when it is used to wrap an iterable object (e.g., a list in a for cycle). To install it, just run pip3 install tqdm --upgrade. We will see how to use tqdm in its very basic form, but I recommend you to have a look at its official documentation for more advanced usage.\nfrom tqdm import tqdm  Function to parallelize First, let us define the function that you would like to parallelize. You often have to rethink the logic of your program, but if you have some independent cpu-intensive operations (e.g., training, parsing of long files, preprocessing), chances are there is a way to rethink your code to create a single, independent function that will run on a separate process.\nLet us define here a function, namely function_to_parallelize, that just takes two parameters as input and sums them. In this function you can actually do any sort of processing, as long as it is dependent solely on the input parameters. It is relevant to observe that you must use a single input parameter for this function. This is not a problem, as this single parameter can be a dictionary containing many parameters that you can unpack. For the purpose of this tutorial, I am using param1 and param2 as parameter names, but of course you can use any name and number of parameters (as long as you embed them within a single params dictionary).\nOf course, parameter names need to match the names in the list you pass to the multiprocessing library (as we will see in a bit).\ndef function_to_parallelize(params): param1 = params['param1'] param2 = params['param2'] # I would recommend to use a dictionary # so you can create richer results # to return to the main process result = {} # Do your processing with param1 and param2. # Here, you can do any sort of processing, and # it will be executed independently on a separate process # on a separate (virtual) core of your CPU result['sum'] = param1+param2 return result  Parallelization To perform the actual parallelization, we need to define the number of (virtual) cores that we want to use, and we need to define a-priori the list of parameter combinations that we are going to spawn in parallel.\nYou can check how many (virtual) cores you have by using the command line tools htop or glances. I would recommend always leaving a couple of cores free, to avoid your machine to go into thrashing with the experts.\n# This chooses all cores except 2, unless there are only two or less cores. NCPU = multiprocessing.cpu_count() - 2 if multiprocessing.cpu_count() \u0026gt; 2 else 1  Then, you need to define the parameters that each process will receive as input. You have to basically create a list of parameters. They can be in any number and quantity, and I usually prefer to define them as a generator of a dict comprehension for brevity.\n# Create a Python generator containing a set comprehension of parameters. # You need to identify what you would like to parallelize on, and then build # the parameters dictionary X_values = [10,22,35,4,532,12,42,53,23] params = ({ 'param1': x, 'param2': 15 } for x in X_values)  Note: I recommend not to pass big structures as parameters (e.g., large matrices) because it can create a lot of inter-process communication that can cause severe delays or memory errors (e.g., use all the RAM in the machine). When possible, use Python generators instead of Python lists (as shown above); in this way, the parameters list will not be pre-allocated all at once, but instead it will be allocated while iterating through the cycle.\nNow we are ready to start the actual multiprocessing. We create a Pool of processes that is managed by the multiprocessing library itself. Then, Python allows us to iterate through the Pool and get back the results. The function p.imap returns the results in order.\nwith multiprocessing.Pool(processes=NCPU) as p: # If you use \u0026quot;generators\u0026quot; for the params list, # tqdm will not know the total length of the array # (required for the progress bar), # so you have to specify it explicitly, # or derive it from your params. MAX_COUNT = len(X_values) for res in tqdm(p.imap(function_to_parallelize, params),total=MAX_COUNT): if res is not None: print(res['sum'])   0%| | 0/9 [00:00\u0026lt;?, ?it/s]  Troubleshooting  Machine becomes slow and unresponsive after using multiprocessing\nI have noticed that sometimes the multiprocessing library, for very large batches of experiments, leaves some operating system reources in usage\u0026mdash;hence, the virtual machine in which you may be running your experiments could become slow and unresponsive after many days of experiments. It may be a problem of the infrastructure I am using to do the experiments, but I have seen this also happening to other colleagues that use other infrastructures. My best recommendation so far is to restart the VM if it becomes very slow after using the multiprocessing library. I know this is a kind of \u0026ldquo;Have you tried to turn it off and on?\u0026rdquo; solution, but it has worked so far. I will dig deeper into this, or if you have any insight please do not hesitate to leave a comment below.\n Each iteration is too slow/Too much RAM is being used\nDouble check that you are using a Python generator (and not a list) for the list of parameters (the object named params in the above code). Also, if you are passing big data structures (e.g., huge feature matrices), remember that that RAM will be used for each process spawned. So, I recommend to do slicing (if possible) before passing big matrices as parameters if you do not need all the rows/columns for your processing.\n  Final Considerations I hope you have found this tutorial useful. Many people go for complex solutions for parallelizing Python code, but in most cases I found out that this easy solution is quite as effective. I have put the full code of this minimum working example on GitHub, here. If you have any suggestion to improve this tutorial or any parts that are not clear, just leave a comment!\nI would be happy to hear if you have other suggestions to perform multiprocessing in a more effective way.\n","date":1551484800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1551484800,"objectID":"fe6434a06df4dcb861e98cacdb73529c","permalink":"/post/tutorial/multiprocessing/","publishdate":"2019-03-02T00:00:00Z","relpermalink":"/post/tutorial/multiprocessing/","section":"post","summary":"This post gives an introduction to the use of Python multiprocessing.","tags":["tutorial","python","multiprocessing"],"title":"Introduction to Python multiprocessing","type":"post"},{"authors":["Chongyang Bai","Qian Han","Ghita Mezzour","Fabio Pierazzi","V.S. Subrahmanian"],"categories":null,"content":"","date":1546300800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1546300800,"objectID":"4a3e8fd2efbbf5739228e23688d528b9","permalink":"/publication/dbank/","publishdate":"2019-01-01T00:00:00Z","relpermalink":"/publication/dbank/","section":"publication","summary":"","tags":[],"title":"DBank: Predictive Behavioral Analysis of Recent Android Banking Trojans","type":"publication"},{"authors":["Feargus Pendlebury\u0026#42;","Fabio Pierazzi\u0026#42;","Roberto Jordaney","Johannes Kinder","Lorenzo Cavallaro"],"categories":null,"content":"","date":1539990000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1539990000,"objectID":"916cbc5e77cac472d47801c5357102df","permalink":"/publication/tesseract-poster/","publishdate":"2018-10-20T00:00:00+01:00","relpermalink":"/publication/tesseract-poster/","section":"publication","summary":"","tags":[],"title":"POSTER: Enabling Fair ML Evaluations for Security","type":"publication"},{"authors":["Giovanni Apruzzese","Fabio Pierazzi","Michele Colajanni","Mirco Marchetti"],"categories":null,"content":"","date":1506812400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1506812400,"objectID":"a7a23ab8f90beba0e64d15f07dbabb88","permalink":"/publication/pivoting/","publishdate":"2017-10-01T00:00:00+01:00","relpermalink":"/publication/pivoting/","section":"publication","summary":"Several advanced cyber attacks adopt the technique of “pivoting” through which attackers create a command propagation\ntunnel through two or more hosts in order to reach their final target. Identifying such malicious activities is one of the most tough\nresearch problems because of several challenges: command propagation is a rare event that cannot be detected through signatures,\nthe huge amount of internal communications facilitates attackers evasion, timely pivoting discovery is computationally demanding. This\npaper describes the first pivoting detection algorithm that is based on network flows analyses, does not rely on any a-priori assumption\non protocols and hosts, and leverages an original problem formalization in terms of temporal graph analytics. We also introduce a\nprioritization algorithm that ranks the detected paths on the basis of a threat score thus letting security analysts investigate just the\nmost suspicious pivoting tunnels. Feasibility and effectiveness of our proposal are assessed through a broad set of experiments that\ndemonstrate its higher accuracy and performance against related algorithms.\n","tags":[],"title":"Detection and Threat Prioritization of Pivoting Attacks in Large Networks","type":"publication"},{"authors":["Fabio Pierazzi","Giovanni Apruzzese","Michele Colajanni","Alessandro Guido","Mirco Marchetti"],"categories":null,"content":"","date":1496358000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1496358000,"objectID":"068a5739892eb01fba45ddadcb43745b","permalink":"/publication/cycon17/","publishdate":"2017-06-02T00:00:00+01:00","relpermalink":"/publication/cycon17/","section":"publication","summary":"","tags":[],"title":"Scalable Architecture for Online Prioritisation of Cyber Threats","type":"publication"},{"authors":["Tanmoy Chakraborty","Fabio Pierazzi","V.S. Subrahmanian"],"categories":null,"content":"","date":1496271600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1496271600,"objectID":"c493d66303ad66b61323e5f924c8c3be","permalink":"/publication/tdsc17/","publishdate":"2017-06-01T00:00:00+01:00","relpermalink":"/publication/tdsc17/","section":"publication","summary":"As the most widely used mobile platform, Android is also the biggest target for mobile malware. Given the increasing number of Android malware variants, detecting malware families is crucial so that security analysts can identify situations where signatures of a known malware family can be adapted as opposed to manually inspecting behavior of all samples. We present EC2 (Ensemble Clustering and Classification), a novel algorithm for discovering Android malware families of varying sizes – ranging from very large to very small families (even if previously unseen). We present a performance comparison of several traditional classification and clustering algorithms for Android malware family identification on DREBIN, the largest public Android malware dataset with labeled families. We use the output of both supervised classifiers and unsupervised clustering to design EC2. Experimental results on both the DREBIN and the more recent Koodous malware datasets show that EC2 accurately detects both small and large families, outperforming several comparative baselines. Furthermore, we show how to automatically characterize and explain unique behaviors of specific malware families, such as FakeInstaller, MobileTx, Geinimi. In short, EC2 presents an early warning system for emerging new malware families, as well as a robust predictor of the family (when it is not new) to which a new malware sample belongs, and the design of novel strategies for data-driven understanding of malware behaviors\n","tags":[],"title":"EC2: Ensemble Clustering and Classification for Predicting Android Malware Families","type":"publication"},{"authors":["Sushil Jajodia","Noseong Park","Fabio Pierazzi","Andrea Pugliese","Edoardo Serra","Gerardo I. Simari","V.S. Subrahmanian"],"categories":null,"content":"","date":1491001200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1491001200,"objectID":"4ac5053766f515179d3546b6431f6c41","permalink":"/publication/tifs17/","publishdate":"2017-04-01T00:00:00+01:00","relpermalink":"/publication/tifs17/","section":"publication","summary":"Malicious attackers often scan nodes in a network in order to identify vulnerabilities that they may exploit as they traverse the network. In this paper, we propose that the system generate a mix of true and false answers in response to scan requests. If the attacker believes that all scan results are true, then he will be on a wrong path. If he believes some scan results are faked, he would have to expend time and effort in order to separate fact from fiction. We propose a Probabilistic Logic of Deception (PLD-Logic) and show that various computations are NP-hard. We model the attacker’s state and show the effects of faked scan results. We then show how the defender can generate fake scan results in different states that minimize the damage that the attacker can produce. We develop a Naive-PLD algorithm and a Fast-PLD heuristic algorithm for the defender to use and show experimentally that the latter performs well in a fraction of the run-time of the former. We ran detailed experiments to assess the performance of these algorithms and further show that by running Fast-PLD offline and storing the results, we can very efficiently answer run-time scan requests.\n","tags":[],"title":"A Probabilistic Logic of Cyber Deception","type":"publication"},{"authors":["Fabio Pierazzi"],"categories":null,"content":"","date":1490569200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1490569200,"objectID":"6aad632eb6eb4d81dc799d4ff1bf17a6","permalink":"/publication/phd/","publishdate":"2017-03-27T00:00:00+01:00","relpermalink":"/publication/phd/","section":"publication","summary":"","tags":[],"title":"Security analytics for prevention and detection of advanced cyberattacks","type":"publication"},{"authors":["Mirco Marchetti","Fabio Pierazzi","Michele Colajanni","Alessandro Guido"],"categories":null,"content":"","date":1465945200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1465945200,"objectID":"09a894354ac1a9532736731f74cdae59","permalink":"/publication/comnet16/","publishdate":"2016-06-15T00:00:00+01:00","relpermalink":"/publication/comnet16/","section":"publication","summary":"Several advanced cyber attacks adopt the technique of “pivoting” through which attackers create a command propagation\ntunnel through two or more hosts in order to reach their final target. Identifying such malicious activities is one of the most tough\nresearch problems because of several challenges: command propagation is a rare event that cannot be detected through signatures,\nthe huge amount of internal communications facilitates attackers evasion, timely pivoting discovery is computationally demanding. This\npaper describes the first pivoting detection algorithm that is based on network flows analyses, does not rely on any a-priori assumption\non protocols and hosts, and leverages an original problem formalization in terms of temporal graph analytics. We also introduce a\nprioritization algorithm that ranks the detected paths on the basis of a threat score thus letting security analysts investigate just the\nmost suspicious pivoting tunnels. Feasibility and effectiveness of our proposal are assessed through a broad set of experiments that\ndemonstrate its higher accuracy and performance against related algorithms.\n","tags":[],"title":"Analysis of High Volumes of Network Traffic for Advanced Persistent Threat Detection","type":"publication"},{"authors":["Mirco Marchetti","Fabio Pierazzi","Alessandro Guido","Michele Colajanni"],"categories":null,"content":"","date":1464822000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1464822000,"objectID":"98ba1bc2b10aa2fb50f2870337e6481d","permalink":"/publication/cycon16/","publishdate":"2016-06-02T00:00:00+01:00","relpermalink":"/publication/cycon16/","section":"publication","summary":"Advanced Persistent Threats (APTs) represent the most challenging threats to the security and safety of the cyber landscape. APTs are human-driven attacks backed by complex strategies that combine multidisciplinary skills in information technology, intelligence, and psychology. Defending large organisations with tens of thousands of hosts requires similar multi-factor approaches. We propose a novel framework that combines different techniques based on big data analytics and security intelligence to support human analysts in prioritising the hosts that are most likely to be compromised. We show that the collection and integration of internal and external indicators represents a step forward with respect to the state of the art in the field of early detection and mitigation of APT activities.\n","tags":[],"title":"Countering Advanced Persistent Threats through security intelligence and big data analytics","type":"publication"},{"authors":["Fabio Pierazzi","Sara Casolari","Michele Colajanni","Mirco Marchetti"],"categories":null,"content":"","date":1454284800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1454284800,"objectID":"f0c4d825bf41be773bdcc3926d486db7","permalink":"/publication/cose16/","publishdate":"2016-02-01T00:00:00Z","relpermalink":"/publication/cose16/","section":"publication","summary":"The huge number of alerts generated by network-based defense systems prevents de- tailed manual inspections of security events. Existing proposals for automatic alerts analysis work well in relatively stable and homogeneous environments, but in modern networks, that are characterized by extremely complex and dynamic behaviors, understanding which approaches can be effective requires exploratory data analysis and descriptive modeling. We propose a novel framework for automatically investigating temporal trends and pat- terns of security alerts with the goal of understanding whether and which anomaly detection approaches can be adopted for identifying relevant security events. Several examples re- ferring to a real large network show that, despite the high intrinsic dynamism of the system, the proposed framework is able to extract relevant descriptive statistics that allow to de- termine the effectiveness of popular anomaly detection approaches on different alerts groups.\n","tags":[],"title":"Exploratory Security Analytics for Anomaly Detection","type":"publication"},{"authors":["Fabio Pierazzi","Andrea Balboni","Alessandro Guido","Mirco Marchetti"],"categories":null,"content":"","date":1434322800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1434322800,"objectID":"c64324b67aaa9720b9031546bdbcd34f","permalink":"/publication/ncca/","publishdate":"2015-06-15T00:00:00+01:00","relpermalink":"/publication/ncca/","section":"publication","summary":"The cloud computing paradigm has become really popular, and its adoption is constantly increasing. Hence, also network activities and security alerts related to cloud services are increasing and are likely to become even more relevant in the upcoming years. In this paper, we propose the first characterization of real security alerts related to cloud activities and generated by a network sensor at the edge of a large network environment over several months. Results show that the characteristics of cloud security alerts differ from those that are not related to cloud activities. Moreover, alerts related to different cloud providers exhibit peculiar and different behaviors that can be identified through temporal analyses. The methods and results proposed in this paper are useful as a basis for the design of novel algorithms for the automatic analysis of cloud security alerts, that can be aimed at forecasting, prioritization, anomaly and state-change detection.\n","tags":[],"title":"The Network Perspective of Cloud Security","type":"publication"},{"authors":["Luca Ferretti","Fabio Pierazzi","Michele Colajanni","Mirco Marchetti"],"categories":null,"content":"","date":1417392000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1417392000,"objectID":"d24a3c5cc76d8a26ccd390b2c41c50f0","permalink":"/publication/tcc_mutedb/","publishdate":"2014-12-01T00:00:00Z","relpermalink":"/publication/tcc_mutedb/","section":"publication","summary":"The success of the cloud database paradigm is strictly related to strong guarantees in terms of service availability, scalability and security, but also of data confidentiality. Any cloud provider assures the security and availability of its platform, while the implementation of scalable solutions to guarantee confidentiality of the information stored in cloud databases is an open problem left to the tenant. Existing solutions address some preliminary issues through SQL operations on encrypted data. We propose the first complete architecture that combines data encryption, key management, authentication and authorization solutions, and that addresses the issues related to typical threat scenarios for cloud database services. Formal models describe the proposed solutions for enforcing access control and for guaranteeing confidentiality of data and metadata. Experimental evaluations based on standard benchmarks and real Internet scenarios show that the proposed architecture satisfies also scalability and performance requirements.\n","tags":[],"title":"Scalable Architecture for Multi-User Encrypted SQL Operations on Cloud Database Services","type":"publication"},{"authors":["Marcello Missiroli","Fabio Pierazzi","Michele Colajanni"],"categories":null,"content":"","date":1405810800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1405810800,"objectID":"3a541c0cabcea4864810a01380a0f1bc","permalink":"/publication/lsauc/","publishdate":"2014-07-20T00:00:00+01:00","relpermalink":"/publication/lsauc/","section":"publication","summary":"Location-based services relying on in-vehicle devices are becoming so common that it is likely that, in the near future, devices of some sorts will be installed on new vehicles by default. The pressure for a rapid adoption of these devices and services is not yet counterbalanced by an adequate awareness about system security and data privacy issues. For example, service providers might collect, elaborate and sell data belonging to cars, drivers and locations to a plethora of organizations that may be interested in acquiring such personal information. We propose a comprehensive scenario describing the entire process of data gathering, management and transmission related to in- vehicle devices, and for each phase we point out the most critical security and privacy threats. By referring to this scenario, we can outline issues and challenges that should be addressed by the academic and industry communities for a correct adoption of in-vehicle devices and related services.\n","tags":[],"title":"Security and privacy of location-based services for in-vehicle device systems","type":"publication"},{"authors":null,"categories":null,"content":"","date":1403611426,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1403611426,"objectID":"d24da91f3533ebb8160986467de122fb","permalink":"/project/confidentiality/","publishdate":"2014-06-24T13:03:46+01:00","relpermalink":"/project/confidentiality/","section":"project","summary":"","tags":[],"title":"Confidentiality Architectures","type":"project"},{"authors":["Luca Ferretti","Fabio Pierazzi","Michele Colajanni","Mirco Marchetti","Marcello Missiroli"],"categories":null,"content":"","date":1403218800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1403218800,"objectID":"4cdb7bf89cc03e0324c1ffaf605e965f","permalink":"/publication/iscc/","publishdate":"2014-06-20T00:00:00+01:00","relpermalink":"/publication/iscc/","section":"publication","summary":"Cloud services represent an unprecedented opportunity, but their adoption is hindered by confidentiality and integrity issues related to the risks of outsourcing private data to cloud providers. This paper focuses on integrity and proposes an innovative solution that allows cloud tenants to detect unauthorized modifications to outsourced data while minimizing storage and network overheads. Our approach is based on encrypted Bloom filters, and is designed to allow efficient integrity verification for databases stored in the cloud. We assess the effectiveness of the proposal as well as its performance improvements with respect to existing solutions by evaluating storage and network costs.\n","tags":[],"title":"Efficient detection of unauthorized data modification in cloud databases","type":"publication"},{"authors":["Luca Ferretti","Fabio Pierazzi","Michele Colajanni","Mirco Marchetti"],"categories":null,"content":"","date":1396306800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1396306800,"objectID":"bfc30e5083c1777ed7cc8b7f4b483739","permalink":"/publication/tcc_onion/","publishdate":"2014-04-01T00:00:00+01:00","relpermalink":"/publication/tcc_onion/","section":"publication","summary":"The cloud database as a service is a novel paradigm that can support several Internet-based applications, but its adoption requires the solution of information confidentiality problems. We propose a novel architecture for adaptive encryption of public cloud databases that offers an interesting alternative to the tradeoff between the required data confidentiality level and the flexibility of the cloud database structures at design time. We demonstrate the feasibility and performance of the proposed solution through a software prototype. Moreover, we propose an original cost model that is oriented to the evaluation of cloud database services in plain and encrypted instances and that takes into account the variability of cloud prices and tenant workloads during a medium-term period.\n","tags":[],"title":"Performance and Cost Evaluation of an Adaptive Encryption Architecture for Cloud Databases","type":"publication"},{"authors":["Luca Ferretti","Fabio Pierazzi","Michele Colajanni","Mirco Marchetti"],"categories":null,"content":"","date":1377298800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1377298800,"objectID":"4d8752ad63900f40bab8741bac5b46f5","permalink":"/publication/securware/","publishdate":"2013-08-24T00:00:00+01:00","relpermalink":"/publication/securware/","section":"publication","summary":"The users perception that the confidentiality of their data is endangered by internal and external attacks is limiting the diffusion of public cloud database services. In this context, the use of cryptography is complicated by high computational costs and restrictions on supported SQL operations over encrypted data. In this paper, we propose an architecture that takes advantage of adaptive encryption mechanisms to guarantee at runtime the best level of data confidentiality for any type of SQL operation. We demonstrate through a large set of experiments that these encryption schemes represent a feasible solution for achieving data confidentiality in public cloud databases, even from a performance point of view.\n","tags":[],"title":"Security and confidentiality solutions for public cloud database services","type":"publication"}]