summaryrefslogtreecommitdiff
path: root/README
blob: ea504f61b6e188c8e9c1c761e420da821a8a7b55 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
This project can currently dump and (partially) normalize white pages from Deutsche Telekom's CD and DVDs.

_=_=_=_=_=_=_=
How to extract
-=-=-=-=-=-=-=

Whereever possible I tried to make everything to work just by typing "sh makecolumns.sh <path to white pages>", where the white pages is what usually comes on your CD as "white" directory. You will need a working C compiler and make, plus aroud 4 GB of hard disc space left for the results. All tools necessary will be built automatically by the shell script.

The scripts also use a tiny helper tool called "el" that you will need to fetch, compile and copy into one of your PATH directories. Find it at https://erdgeist.org/arts/software/el, check out the sources with "git clone git://erdgeist.org/el" or "cvs -d :pserver:anoncvs@cvs.erdgeist.org:/home/cvsroot co el".

The script will create a work/ directory and put a dump directory according to the name of your white page directory below that.

_=_=_=_=_=_=_=_=_=_=_=_=_
File format documentation
-=-=-=-=-=-=-=-=-=-=-=-=-

The on-disc-data currently comes in four flavours (see https://erdgeist.org/posts/2008/datenmessie.html)

version 1) Teleauskunft 1188 from 1992, (April-June)
version 2) Teleauskunft 1188 Telefon-Teilnehmer, Oktober 1995 / Telefon-Teilnehmer Gesamtausgabe from 1995/1996
version 3) Telefonbuch für Deutschland, Version 1.0 1996 through DasTelefonbuch, Deutschland, Herbst 2003
version 4) DasTelefonbuch, Map&Route, Frühjahr 2004 until now

version 1
=========

Notes: Strings are encoded in cp437, those inside records stored in 7-bit packed encoding. Only the .001 files on each CD are interesting.

Each file consists of a standard header and a number of pages, with pages starting at 0x800, being spaced at 0x2000 steps.

The header's important values are (uint16_t*)0x40 number of pages, (uint32_t*)0x42 total number of records in file and a \0 separated list of gasse, city, zip and prefix, starting at 0xe8.

Each page can either be a "normal" one, with phone entries or a "blob" one, with multi line records, being referenced from "normal" pages inside the same file. It starts with a flag (uint8_t*)0x00, a size of blob's contents (i.e. if != 0, this is a blob page) at (uint16_t*)0x02, a count of records in that page at (uint16_t*)0x04 and for each record an offset into this page's records, relative to where they start: at 0x0e. Should this offset happen to be >0x1fff, it refers to a "blob" page that should be substituted here.

Each record starts with an entry count at (uint16_t*)0x00. If a record consists of multiple entries, think of lines as continuation of the "lines above". Usually they are used to describe multiple extensions in a larger subscriber. If the page's flag is zero, each record also has an "prefix" offset at (uint16_t*)0x02, if it is non-zero, the first entry is prepended by a shared prefix (think multiple instances of the same or similar family names which are then "compressed" by this hack). This prefix is unpacked the same way as records are. "Blob" pages are not compressed this way.

The entries are then packed into a 7 bit stream, with the 0000001b separating the columns: "Nachname", "Vorname", "Adresszusatz", "Ortszusatz", "Zustellamt oder PLZ Ost", "Strassenname", "Hausnummer", "Namenszusatz", "Verweise", "Vorwahl", "Rufnummer". An entry ends with a 0000000b.

version 2
=========

Each database file (atb?dd00) is composed of several continous chunks of PKware packed blocks. Each decompressed block is 8192 bytes. The amount of chunks (excluding the first) is at (uint16_t*)0x14. Start offsets of all but the first chunks are in turn encoded in a 19bit packed list of integers at 0x20. The first chunk starts at the end of this table, the rest is relative to 0x800. The amount of blocks per chunk is at (uint16_t*)0x1c, for the first chunk the (amount-1) is at (uint16_t*)0x1e. The example tool "blast" from the zlib archive can decompress these blocks.

A corresponding index file atb?di00 consists of a list of uint32_t offsets for each record, relative to the continous stream of all decompressed 8192 byte blocks. These offsets start at file position 0x8, with the (uint32_t*)0x0 being the amount of records in the index (and thus in the database).

The CDs do have different database layouts for each region (NO, W, S). The reason is prefixes for north east Germany having more than 5 digits. A record is arranged around the record offset, with (most) flags, counts and offsets after the offset and strings before the offset.

As in version 1, each record can have multiple entries. The columns present in the first entry (that is always there) are described by the bits set in a flag word and refer to the string table. Columns present in continuation entries are directly encoded by tuples described below.

The amount of entry parts is at (uint16_t*)0x0 guaranteed to be at least 1, the number of flag bytes whose bits describe the first entry is at (uint16_t*)0x2, both uint16_t relative to the record offset. Each entry is described by a tuple of uint16_t (column_id, offset) stored in a single (uint32_t*) starting at (uint16_t*)0x4. The fact that they belong together in a uin32_t is important, because they can not span a block boundary and are pushed to the beginning of the next block, with all further tuples pushed accordingly. The offsets in each entry's tuples refer to a position before the record, adjusted by the number of flag bytes for the first entry (i.e. if there's 2 flag bytes signalled at (uint16_t*)0x2, the string table ends at two bytes before record offset).

All strings are encoded in cp437.

The first entry is parsed from the end, all strings with dynamic length end with their length byte, so can be consumed from the end, fixed length strings do not have a length byte but can be \0-terminated. The exact layout of the first entry depends on the database layout. Region NO differs slighty. All columns we describe now are in reverse order, i.e. PPPPPZZZZZC: In all layouts the entry ends with a continuation indicator, which is a fixed 1 character string "1" prepended a fixed 5 character string containing a zip code. In versions S and W, we find a fixed 5 character string containing a phone prefix (Vorwahl) which actually is not used except in the UI (the correct Vorwahl is encoded with bit 0x0020 in S/W and 0x0010 in NO).

The next columns depend on the bits set in the first record's flag bit:
If the bit 0x0080 is set, a fixed 5 character string is present that overrides the first fixed zip code.
If the bit 0x0040 is set, a fixed 1 character string "X" is present that, whose meaning is unclear.
If the bit 0x0020 is set, in NO dynamic string containing the Rufnummer is present, in S/W a fixed 5 character string containing the Vorwahl is present.
If the bit 0x0010 is set, a dynamic string is present, representing the Vorwahl in NO and the Rufnummer in S/W.
The next 8 bits indicate dynamic strings: 0x0008 => Vorname, 0x0004 => Strasse, 0x0002 => Ort, 0x0001 => Hausnummer, 0x8000 => Zusaetze, 0x4000 => Ortszusatz, 0x2000 => S/W: Adresszusatz NO: Verweise, 0x1000 => NO: Adresszusatz S/W: Verweise.

From now we only have dynamic strings: A column representing the Ort is always present. In case of the NO layout there is the unused phone prefix (Vorwahl). Finally there is the Name column.

Now we look at the continuation entry parts: A column_id of 0x4003 denotes a start of an entry in the sense of a new line in the telephone book. This is actually always the column_id for the first entry. For all but the first entry the corresponding string in the string table is the one byte string "2", and as stated above for the first entry this continuation indicator is always "1". All other entries represent columns for the current entry. The exact mapping from column id to column varies between NO, S and W and can be looked up in the source code ;) String length is implicite by the previos string's offset.

version 3
=========

All strings are iso8859-1. Some of the files on disc are obfuscated while others are plain. They consist of chunks of lha compressed data (as readable by the lha command). The obfuscation is a simple XOR of the first 32 (for streets: 34) bytes with a static 4 byte key that changes with every CD. Since lha headers do have a static signature (i.e. -lh5-), it's easy to derive the current key by xoring the static value with the bytes found in the file. Conveniently un-obfuscated files generate a all-zero key and don't need special treatment. The path name of each chunk is identical so it must be re-written and the header checksum must be re-calculated.

After decompressing all files (quite a lot, it's split at 3k entries per file) from dat/teiln.dat (case insensitive), we get three kinds of files: one with all Vorname columns, one with all Nachname columns and one with the rest of the columns. Before 2000_Q1 files {0,3,6,...} were Nachname, {1,4,7,...} were Vorname and {2,5,8,...} were all the other tables. After 2000_Q1 (incl.) {0,3,6,...} were all other columns, while {1,4,7,...} were Nachname and {2,5,8,...} Vorname columns. The first char in each entry in the Nachname column contains a continuation flag, where "1" means single line entry, "3" is the first and "2" are the remaining lines in a continuation.

If there is a dat/strassen.dat (case insensitive), the concatenated decompressed chunks are a list of all street names referenced (hex, 0-based) in the street/hnr column.

If there is a dat/karto.dat (case insensitive), the concatenated decompressed chunks are a sorted list of all zip/streetname(/hnr) combinations on that CD, each line finished by the geo coordinates of that address.

version 4
=========

All strings are iso8859-1. All relevant information are in the files phonebook.db and streets.tl. If you need geo-coordinates to all your addresses, you'll also need the zip-streets-hn-geo.tl or zip-streets-hn.tl files, respectively. The files consist of zlib compressed chunks with streetname file just being a \0 separated list of streets, each referenced (dec, 0-based) from the street/hnr column. For the zip-streets*-geo.tl files, the concatenated decompressed chunks are a sorted list of all zip/streetname(/hnr) combinations on that CD, each line finished by the geo coordinates of that address (since they are sorted, you can do a binary search for each address in that file).

The phonebook.db decompresses in a lot chunks with around 3000 entries sorted into 11 columns, the first being a binary flag byte and the latter: Nachname, Vorname, Namenszusatz+Addresszusatz, Verweise, Strassenindex_Hausnummer, Vorwahl, Postleitzahl, Ort, Rufnummer, Email+Webadresse respectively.

The flag byte's lower nibble is 0 for a single line entry, 1 for the start of a continuation record and 2 for a trailing line of a continuation. The flag's top nibble bits are 0x80 set for a business record (as opposed to a natural person), 0x40 set, if the number must not be included in reverse query results and 0x20 used but unknown.