This modern fork requires Python 3.12 or higher. You can install the dependencies using either pip or conda.
This is the recommended approach to ensure you get the exact frozen versions that are guaranteed to work.
# 1. Create a virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
# 2. Install the frozen dependencies
pip install -r requirements.frozen.txtIf you prefer Anaconda or Miniconda, an environment.yml file is provided.
# 1. Create the environment
conda env create -f environment.yml
# 2. Activate the environment
conda activate CVEfixesBecause of limitations in GitHub storage, we provide a compressed SQL dump of the CVEfixes vulnerability dataset via Zenodo with DOI: 10.5281/zenodo.4476563. The following script can be used to convert this compressed SQL dump into an SQLite3 database:
$ sh Code/create_CVEfixes_from_dump.shIt is also possible to do this by hand (Windows users can use
sqlite3.exe instead of sqlite3)
$ gzcat Data/CVEfixes.sql.gz | sqlite3 Data/CVEfixes.dbThe overall structure of the database is as shown in ER diagram. You can use any SQLite tool to view and query the database. DB Browser for SQLite is an open source visual explorer for SQLite databases that is available for Mac, Windows and Linux.
We provide a Jupyter Notebook with example queries to extract the data at different abstraction levels, code to generate statistics, as well as the code to replicate all tables and plots presented in the paper based on the CVEfixes data.
Some example queries to extract the part of CVEfixes database are as follows:
- a query to extract all the method_level vulnerability data of C programming language.
SQL_QUERY = "SELECT m.method_change_id, m.name, m.code, m.before_change, f.programming_language
from file_change f, method_change m
WHERE m.file_change_id=f.file_change_id
AND f.programming_language='C';"- another example as a query to extract all the code patches of Java programming language that have added/removed only a single statement to fix vulnerabilities.
SQL_QUERY = "SELECT cv.cve_id, f.filename, f.num_lines_added, f.num_lines_deleted, f.code_before, f.code_after, cc.cwe_id
FROM file_change f, commits c, fixes fx, cve cv, cwe_classification cc
WHERE f.hash = c.hash
AND c.hash = fx.hash
AND fx.cve_id = cv.cve_id
AND cv.cve_id = cc.cve_id
AND f.num_lines_added<=1
AND f.num_lines_deleted<=1
AND f.programming_language='Java';;"Create a .CVEfixes.ini configuration file in the project root
(where this INSTALL.md file lives) defines the following variables
to customize paths in the [CVEfixes] section:
-
database_path: directory that should contain the CVEfixes database file (and some temporary files during extraction). -
sample_limit: The number of samples to be extracted, sample_limit = 0 is interpreted as unlimited samples, this is discussed in more detail below.
The repository contains a file example.CVEfixes.ini.
We observe that more than 98% repositories of CVEfixes are hosted on
GitHub. The GitHub API
allows us to retrieve meta information of public repositories. Without
configuration of a token, this API is rate limited and only allows
gathering the information for approximately 25 repositories. Thus, we
recommend the user to obtain a GitHub token and configure a [GitHub]
section of the .CVEfixes.ini configuration file which has two
variables user and token for your GitHub username and access token
respectively. Change the values of these variables to set up the token.
It is recommended to not disclose the token information to prevent its
misuse.
The following script recreates the CVEfixes database file from scratch
up for a time-frame from 2002 up to the latest available CVE records. Be
warned that depending on available internet speed and processing power,
the process can take quite some time (with a modern laptop, around 8-10
hours on T1, and up to 3 days on a simple home connection). Note that
running this script will overwrite an existing CVEfixes.db file in
database_path, but that it is possible to restore the database from
the SQL dump on Zenodo, as mentioned above. Before running the
extraction, see also the above remark on setting up a GitHub
token.
$ sh Code/create_CVEfixes_from_scratch.shAs mentioned, the complete extraction of the CVEfixes dataset could
take up to 3 days depending on the internet connection. To enable simple
demonstrations, the .CVEfixes.ini configuration file has a
sample_limit variable that is used to define the number of commits for
CVEs to collect. Setting sample_limit = 0 is interpreted as unlimited,
i.e. collect all available data. When the sample_limit is set to a
value different than 0, only that many commits are collected from CVEs
for the current year. We tested sample_limit = 25, which took
approximately 12 minutes to gather. However, this can take much longer
if any of the CVEs corresponds to major projects. Therefore, whenever
sample_limit is set to a value different than 0, we filter out five
major projects to minimize the time to collect a sample.