Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
disorder.o
md5.o
tmp.foo
wikiq
wikiq.o

3 changes: 1 addition & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
CXXFLAGS = -O3
CFLAGS = $(CXXFLAGS)
OBJECTS = wikiq.o md5.o disorder.o
OBJECTS = wikiq.o md5.o

all: wikiq

wikiq: $(OBJECTS)
$(CXX) $(CXXFLAGS) $(OBJECTS) -lpcrecpp -lpcre -lexpat -o wikiq

disorder.o: disorder.h
md5.o: md5.h

clean:
Expand Down
46 changes: 0 additions & 46 deletions README

This file was deleted.

45 changes: 45 additions & 0 deletions README.OLD
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
wikiq: a simple and fast stream-based MediaWiki XML dump parser

authors: Erik Garrison <erik@hypervolu.me>
Benjamin Mako Hill <mako@atdot.cc>

overview:

wikiq is written in C++ using expat. It is designed to enable
researchers to rapidly extract revision histories (minus text and
comments) from large XML datasets.

use:

To use, first make sure you have libexpat and libpcrecpp installed (e.g.
via packages libexpat1 and libpcre3-dev on Debian or Ubuntu), then:

% make
% ./wikiq -h # prints usage
% 7za e -so hugewikidatadump.xml | ./wikiq >hugewikidatadump.tsv


features:

In addition to parsing WikiMedia XML data dumps into a tab-separated
tabular format, wikiq can match Perl-compatible regular expressions
against revision content, can extract article diffs, and can match
regexes against the additions and deletions between revisions. Any
number of regular expressions may be supplied on the command line, and
may be tagged using the '-n' and -N options.

MD5 checksums of revisions are used at runtime.

output:

wikiq generates these fields for each revision:

title, articleid, revid, timestamp, anon, editor, editorid, minor,
text_length, text_md5, reversion, additions_size, deletions_size
.... and additional fields for each regex executed against content or
added/deleted diffs

Boolean fields are TRUE/FALSE except in the case of reversion, which is blank
unless the article is a revert to a previous revision, in which case, it
contains the revision ID of the revision which was reverted to.

22 changes: 22 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
This C++ version of `wikiq` in this repository has not be updated since ~2011
and has a number of critical limitations. The repository is being kept here for
historical and archival purposes. Please don't rely on it!

**A improved version of a very similar stream-based XML-parser for MediaWiki by
the same authors can be found here:**

> **[https://code.communitydata.cc/mediawiki\_dump\_tools.git](https://code.communitydata.cc/mediawiki_dump_tools.git)**

These new tools are maintained by some of the same authors (now based in the
[Community Data Science Collective](https://communitydata.cc)) and the new tool
relies on many of the same libraries including the `expat` non-validating XML
parser.

This new version has a very similar interface, is in written in Python, and
leverages [Python MediaWiki Utilities](https://github.com/mediawiki-utilities)
for XML dump parsing and several other tasks. The two tools have been
benchmarked and the new tool's performance measures are generally within 90% of
the C++ version of tool in this repository.

>> —[Benjamin Mako Hill](https://mako.cc/)

192 changes: 0 additions & 192 deletions disorder.c

This file was deleted.

Loading