How do you "cut" a wikipedia database dump with dd?

7 replies [Last post]
Other_Cody
Offline
Joined: 12/20/2023

https://en.wikipedia.org/wiki/Wikipedia_talk:Database_download#How_to_use_multistream?

shows


How to use multistream?

The "How to use multistream?" shows

" For multistream, you can get an index file, pages-articles-multistream-index.txt.bz2. The first field of this index is the number of bytes to seek into the compressed archive pages-articles-multistream.xml.bz2, the second is the article ID, the third the article title.

Cut a small part out of the archive with dd using the byte offset as found in the index. You could then either bzip2 decompress it or use bzip2recover, and search the first file for the article ID.

See https://docs.python.org/3/library/bz2.html#bz2.BZ2Decompressor for info about such multistream files and about how to decompress them with python; see also https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dumps/+/ariel/toys/bz2multistream/README.txt and related files for an old working toy. "

I have the index and the multistream, and I can make a live usb flash drive with https://trisquel.info/en/wiki/how-create-liveusb

lsblk

umount /dev/sdX*

sudo dd if=/path/to/image.iso of=/dev/sdX bs=8M;sync

,but I do not know how to use dd that well to

"Cut a small part out of the archive with dd using the byte offset as found in the index." than "You could then either bzip2 decompress it or use bzip2recover, and search the first file for the article ID. "

Is there any video or more information on Wikipedia about how to do this, so I can look at Wikipedia pages, or at least the text off-line?

Thank you for your time.

Other Cody (talk) 22:46, 4 December 2023 (UTC)

Though there is no answer there as to how to

"Cut a small part out of the archive with dd using the byte offset as found in the index."

or about

"You could then either bzip2 decompress it or use bzip2recover, and search the first file for the article ID. "

so does anyone know how to do that with dd here, if it can be done still?

https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia

I have both the

enwiki-20211020-pages-articles-multistream.xml.bz2

and the

enwiki-20211020-pages-articles-multistream-index.txt.bz2

files.

And both the

enwiki-20211020-pages-articles-multistream.xml.bz2.torrent

and the

enwiki-20211020-pages-articles-multistream-index.txt.bz2.torrent

files.

https://www.litika.com/torrents/enwiki-20211020-pages-articles-multistream.xml.bz2.torrent

https://www.litika.com/torrents/enwiki-20211020-pages-articles-multistream-index.txt.bz2.torrent

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

info dd says:
‘if=FILE’
Read from FILE instead of standard input.
(...)
‘skip=N’
Skip N ‘ibs’-byte blocks in the input file before copying. If
‘iflag=skip_bytes’ is specified, N is interpreted as a byte count
rather than a block count.
(...)
‘count=N’
Copy N ‘ibs’-byte blocks from the input file, instead of everything
until the end of the file. if ‘iflag=count_bytes’ is specified, N
is interpreted as a byte count rather than a block count.

Read https://alchemy.pub/wikipedia and you should understand what are the numbers (consecutive, in the first column of the index) that should substitue M and N in the command below:
$ dd if=enwiki-20211020-pages-articles-multistream.xml.bz2 iflag=skip_bytes,count_bytes skip=M count=$(expr N - M) | bunzip2

If that works, please improve the (indeed unclear) section "How to use multistream?" of https://en.wikipedia.org/wiki/Wikipedia:Database_download

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

Does it work?

Other_Cody
Offline
Joined: 12/20/2023

I used Engrampa Archive Manager to unzip/decompress the enwiki-20211020-pages-articles-multistream-index.txt.bz2

file to

enwiki-20211020-pages-articles-multistream-index.txt

than I could use the program less to see

597:10:AccessibleComputing
597:12:Anarchism

and more things, seeing the

import csv

thing may have just changed the mouse pointer

import csv

def search_index(Apollo, enwiki-20211020-pages-articles-multistream-index.txt):
byte_flag = False
data_length = start_byte = 0
index_file = open(Apollo, 'r')
csv_reader = csv.reader(enwiki-20211020-pages-articles-multistream-index.txt, delimiter=':')
for line in csv_reader:
if not byte_flag and search_term == line[2]:
start_byte = int(line[0])
byte_flag = True
elif byte_flag and int(line[0]) != start_byte:
data_length = int(line[0]) - start_byte
break
index_file.close()
return start_byte, data_length
import-im6.q16: attempt to perform an operation not allowed by the security policy `PS' @ error/constitute.c/IsCoderAuthorized/421.
bash: syntax error near unexpected token `('
Could not find command-not-found database. Run 'sudo apt update' to populate it.
byte_flag: command not found
Could not find command-not-found database. Run 'sudo apt update' to populate it.
data_length: command not found
bash: syntax error near unexpected token `('
bash: syntax error near unexpected token `('
bash: syntax error near unexpected token `if'
bash: syntax error near unexpected token `('
Could not find command-not-found database. Run 'sudo apt update' to populate it.
byte_flag: command not found
bash: syntax error near unexpected token `elif'
bash: syntax error near unexpected token `('
bash: break: only meaningful in a `for', `while', or `until' loop
bash: syntax error near unexpected token `return'

after some time to a cross-hair like thing, after I clicked and than saw a blank csv file.

Than I was having problems with

dd if=enwiki-20211020-pages-articles-multistream.xml.bz2 iflag=skip_bytes,count_bytes skip=M count=$(expr N - M) | bunzip2

as I may not have had a

csv.reader()

installed, maybe, to see how many bytes are needed.

dd if=enwiki-20211020-pages-articles-multistream.xml.bz2 iflag=skip_bytes,count_bytes skip=10 count=$(expr 597 - 10) | bunzip2
1+1 records in
1+1 records out
bunzip2: (stdin) is not a bzip2 file.
587 bytes copied, 0.00438569 s, 134 kB/s

I put the wikipedia compressed file to another device before putting it on the computer I'm testing this on.

Though I also think I may not be doing this correctly.

I'm testing these 2 files on my computer now, not another device.

I could uncompress the smaller one, with the Engrampa Archive Manager, though I did not yet find out how to "cut" into the larger one.

dd if=enwiki-20211020-pages-articles-multistream.xml.bz2 iflag=skip_bytes,count_bytes skip=597 count=$(expr 10 - 597) | bunzip2
dd: invalid number: ‘-587’

bunzip2: Compressed file ends unexpectedly;
perhaps it is corrupted? *Possible* reason follows.
bunzip2: Inappropriate ioctl for device
Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

I got odd text in the terminal. I likely am doing something wrong. After this command

dd if=enwiki-20211020-pages-articles-multistream.xml.bz2 iflag=skip_bytes,count_bytes skip=10 count=$(expr 5970 - 10)

Other_Cody
Offline
Joined: 12/20/2023

bzip2recover enwiki-20211020-pages-articles-multistream.xml.bz2
bzip2recover 1.0.8: extracts blocks from damaged .bz2 files.
bzip2recover: searching for block boundaries ...
block 1 runs from 80 to 4691
block 2 runs from 4856 to 1941148
block 3 runs from 1941197 to 3873636
block 4 runs from 3873685 to 5332089
block 5 runs from 5332256 to 7335497
block 6 runs from 7335546 to 9290647
block 7 runs from 9290696 to 11273049
block 8 runs from 11273098 to 13250379
block 9 runs from 13250428 to 15161641
block 10 runs from 15161690 to 16438240
block 11 runs from 16438408 to 18393502
block 12 runs from 18393551 to 20376720
block 13 runs from 20376769 to 22373708
block 14 runs from 22373757 to 24300728
block 15 runs from 24300777 to 26358314
block 16 runs from 26358363 to 28124433
block 17 runs from 28124600 to 30169218
block 18 runs from 30169267 to 32106901
block 19 runs from 32106950 to 34082802
block 20 runs from 34082851 to 35672702
block 21 runs from 35672864 to 37504204
block 22 runs from 37504253 to 39470534
block 23 runs from 39470583 to 41513427
block 24 runs from 41513476 to 43492474
block 25 runs from 43492523 to 44586453
block 26 runs from 44586616 to 46536260
block 27 runs from 46536309 to 48574914
block 28 runs from 48574963 to 50633049
block 29 runs from 50633098 to 51966810
block 30 runs from 51966976 to 54047569
block 31 runs from 54047618 to 56038311
block 32 runs from 56038360 to 58025409
block 33 runs from 58025458 to 59155797
block 34 runs from 59155960 to 61214301
block 35 runs from 61214350 to 63159390
block 36 runs from 63159439 to 65216452
block 37 runs from 65216501 to 65456085
block 38 runs from 65456248 to 67500889
block 39 runs from 67500938 to 69513634
block 40 runs from 69513683 to 70390493
block 41 runs from 70390656 to 72328684
block 42 runs from 72328733 to 74259102
block 43 runs from 74259151 to 76281464
block 44 runs from 76281513 to 78250331
block 45 runs from 78250380 to 78933702
block 46 runs from 78933864 to 80759485
block 47 runs from 80759534 to 82706968
block 48 runs from 82707017 to 84715469
^C

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

As I wrote, N and M, in my command line (that you can directly execute in a terminal; it not Python code), should be "consecutive, in the *first* column of the index". Here you replaced N and M with numbers in the first and *second* columns. So, assuming you want the page on AccessibleComputing, M would be 597 and N would be the next different (and necessarily greater) number in the *first* column. It may be up to 100 lines below.

Indeed, "the multistream dump file contains multiple bz2 'streams' (bz2 header, body, footer) concatenated together into one file" and "each separate 'stream' (or really, file) in the multistream dump contains 100 pages, except possibly the last one", all according to https://en.wikipedia.org/wiki/Wikipedia:Database_download (which may feel a little cryptic, but, as I wrote, https://alchemy.pub/wikipedia#c is clearer).

We are here focusing on decompressing the 100 pages including the relevant one. To get rid of the 99 other pages, we can then use the second column of the index. It is not a number of bytes. Indeed, https://en.wikipedia.org/wiki/Wikipedia:Database_download specifies that "the second is the article ID". It certainly occurs in the uncompressed XML.

Other_Cody
Offline
Joined: 12/20/2023

Thank you, Magic Banana, something like

dd if=enwiki-20211020-pages-articles-multistream.xml.bz2 iflag=skip_bytes,count_bytes skip=597 count=$(expr 666522 - 597) | bunzip2 > text

does work much better, and now I can read the wikipedia text off-line. I did not yet use the second column of the index yet, though I now see the mistake I made the first time.

Magic Banana

I am a member!

I am a translator!

Offline
Joined: 07/24/2010

In the XML, the tag "id" contains the id found in the second column of the index. I guess you will parse the XML, using an existing library/class. At that point, you will be able to filter the queried article(s).

If the query is based on the name(s) of the page(s) in the third column of the index, you can use this simple Shell script;
#!/bin/sh
if [ -z "$3" ]
then
printf "Usage: $0 multistream.xml.bz2 multistream-index.txt page...
"
exit
fi
multistream="$1"
index="$2"
shift 2
for page in "$@"
do
printf "$page
"
done | awk -F : '
FNR == 1 {
++argind }
argind == 1 {
pages[$0] }
argind == 2 {
if (m) {
if ($1 - m) {
print m " count=" $1 - m
m = 0 } }
else {
n = $1
sub(/[^:]*:[^:]*:/, "")
if ($0 in pages)
m = n } }' - "$index" | while read range
do
dd if="$multistream" iflag=skip_bytes,count_bytes skip=$range | bunzip2
done

It retrieves all the page(s) with this/those name(s) and the other ones in the same "stream(s)". Notice that a same name may appear several times, as far as I understand, hence the need to read the whole index. Only once for all the page names though, but the index is huge. Some names include colons, hence my use of AWK's sub function rather than directly testing "$3 in pages".