DETAILS


WEB PAGES



▸ 40.716.971 Web Pages.


▸ uncompressed: approximately 500 GB (backup of MariaDB database).

LANGUAGE IDENTIFIERS


▸ All 2-letter language identifers for the dataset conform to the ISO 639-1 language ID list.


▸ In the HULTIG-C data set, all European languages were used, as well as some predominant languages in the Asian continent.

RECORD COUNTS IN MAIN LANGUAGES:


▸ English: 4.077.091

▸ Portuguese: 1.586.563

▸ Welsh: 1.100.927

▸ German: 541.280

▸ Russian: 520.704

▸ French: 445.138

▸ Spanish: 419.023

▸ ...: ...