The Debsources Dataset ====================== The Debsources Dataset is the database underpinning the most notable instance of the Debsources platform. Debsources ---------- [Debsources][1] is a platform that provides access on the Web to the source code of the [Debian][2] operating system. Debsources allows to browse through Debian source packages and render contained source code files on the Web. Debsources indexes all Debian source code and permits to search through it via various means (defined symbols, checksums, regular expressions, etc). [1]: http://sources.debian.net [2]: http://www.debian.org A notable live instance of Debsources is available at , providing access to both current and historical Debian releases dating back to 1998. Debsources is Free Software, distributed under the terms of the GNU Affero General Public License. Dataset ------- The Debsources Dataset is the database underpinning the Debsources instance running at . The dataset contains both Debian metadata (e.g., which software packages are available in which release, which source code file belong to which package, release dates, etc.) and source code information obtained by running popular indexing and measurement tools on Debian source packages. In particular, the source code of all available packages has been subject to: - SHA256 checksum computation on each source file - [ctags][3] indexing - [sloccount][4] measurement - disk usage measurement [3]: http://ctags.sourceforge.net/ [4]: http://www.dwheeler.com/sloccount/ Please note that the actual source code of all available packages (for a total size of about 700 GB) is not included in the dataset; only metadata and derived information are. The Debsources Dataset is made available under the terms of the Creative Commons Attribution-ShareAlike 4.0 International Public License (CC BY-SA); see the file LICENSE for more information. Version ------- The Debsources Dataset is distributed as a database snapshot took at a specific point in time on the machine hosting . The timestamp of the snapshot is encoded in the file name, as the number of seconds since the UNIX epoch. For example, the dataset distributed as `debsources.1423576120.xz` corresponds to a database snapshot took on: $ date -R -u -d @1423576120 Tue, 10 Feb 2015 13:48:40 +0000 How to use ---------- The Debsources Dataset comes as a textual dump of a [PostgreSQL][5] database, compressed with xz. The dump has been obtained from Postgres 9.3, but it should be compatible with any version of Postgres >= 9.1. [5]: http://www.postgresql.org/ To use the dataset you should first install Postgres, then create a dedicated database, and finally import the dataset into it. For the last two steps you can proceed as follows, acting as a user with suitable Postgres permissions: $ createdb debsources $ xzcat debsources.1423576120.xz | psql debsources On a modern high-end laptop equipped with a fast SSD disk, the import takes about 3.5 hours. The freshly imported database will take about 80 GB of disk space, 50 GB of which will be used by indexes. Database schema --------------- A database schema, obtained using [postgresql-autodoc][6] on a frehsly restored dataset, is available in the files dbschema.html and dbschema.pdf . [6]: http://autodoc.projects.pgfoundry.org/ Table sizes ----------- To give an idea of the breadth of the dataset, here are the sizes of some of the tables in the dataset: - checksums: ~37 M (million) tuples - ctags: ~370 M - files: ~37 M - package_names: ~30 K (thousand) tuples - packages: ~87 K - sloccounts: ~310 K - suites: ~123 K - suites_info: 18 (units) number of indexed Debian releases Tips & tricks ------------- - Given the Debian-specific [semantics][7] of package version ordering, to do SQL queries on the Debsources Dataset that sort by version you might want to use the [debversion][8] Postgres extension. [7]: https://www.debian.org/doc/debian-policy/ch-controlfields.html#s-f-Version [8]: https://tracker.debian.org/pkg/postgresql-debversion References ========== * Matthieu Caneill, Stefano Zacchiroli. Debsources: Live and Historical Views on Macro-Level Software Evolution. In proceedings of ESEM 2014: 8th International Symposium on Empirical Software Engineering and Measurement, September 18-19, 2014, Torino, Italy. ISBN 978-1-4503-2774-9, ACM 2014. DOI 10.1145/2652524.2652528