Hultig-Corpus

Welcome to HULTIG-C

What is a Corpus?

The word "corpus", derived from Latin, which means body, can be used to refer (vaguely) to any body of written or spoken text. In modern linguistics, the term is commonly used to refer to large collections of texts representing a sample of a particular variety or use of language (s) that are stored in raw text, are presented in a machine-readable form.

Hultig-Corpus is a multilingual corpus, created to support research on information retrieval and related technologies of human language. Project that is part of the center of Human Language Technology and Bioinformatics (HULTIG) founded in 2003 by Gael Harry Dias, at the time Professor of the Department of Informatics of the University of Beira Interior (UBI), Covilhã, Portugal. And currently directed by Dr. João Paulo Cordeiro, Faculty of the informatics Department of UBI.

• • •

Learn More

About Hultig-C

The Technology center of Human Language and bioinformatics ((HULTIG) is a research group of the Department of Informatics, the University of Beira Interior. Over time, we have worked on a variety of topics related to the automatic processing of human language, with particular focus on the strand application of them. Among the various sub domains, we have devoted special attention to the following:

✔Information research;
✔Mining and text extraction;
✔Automatic summarization;
✔Automatic detection of plagiarism;
✔Feelings analysis in text;
✔Lexical semantics;
✔Textual Alignment Similarity;
✔Aesthetic characterization of the text;

The Hultig-C is a corpus that began to be developed in January 2017, and consists of about thousands of words in different languages, collected based on the raw texts (of different natures, different linguistic and sophistication levels) obtained through the Web Page and indexed with OpenWebSpider. Hultig-C is being developed and maintained at UBI, by Center of Technology of Human Language and Bioinformatics (HULTIG) of the Department of Informatics. This corpus arises as a result of ongoing work that aims to support the automatic processing of human language, extending and gradually improving the corpus, in all its dimensions, in order to provide a high-level resource for research In computational linguistics and for the development of applications and language technologies.

In addition to a majority concern with the application and technology, we also consider the most theoretical and conceptual aspects of the study of human language, in particular computational linguistics. Additional information about the Hultig Dataset is available in Hultig web site conference.

• • •

Hultig-C Details

Web Pages:

• 4, 943, 857 web pages.

• 100 GB uncompressed.

Language identifiers:

• All 2-letter language identifers for the dataset conform to the ISO 639 language ID list.

• In the Hultig-C data set, all European languages were used, as well as some predominant languages in the Asian continent.

Web Graph - Entire Data Set:

• Unique URLs: 3, 914, 526

• • •

How to get it

The Hultig-C data sets are distributed by the center of Human Language Technology and Bioinformatics (Hultig), only for academic and research purposes.

The Hultig-C is Open Source, thereby facilitating the process of obtaining a Hultig-C DataSet.

• • •

Check Online Services

Online Services

Hultig-C provides a set of services for automatic processing of human language, identifying patterns in the collections of information stored in a disorganized manner. Thus enabling a set of operations normally required in NLP.

This plataform is under construction.

Please come back soon!

• • •

Check Indexing With OpenWebSpider

Indexing With OpenWebSpider

The most efficient way to organize and find a file in a database is through indexing. Objective, decentralising the production of information as well as distributing them in a way Extensive and rapid.

The indexing of Hultig-C began in January 2017, with the features of the OpenWebSpider.

OpenWebSpider is a web Spider (also known as tracker or Web robot) and a search engine, is a program that navigates autonomously on Web sites, reading your pages and other information to create entries for a search engine index.

These programs are called spiders, because they visit many sites in parallel and at the same time, spanning a large web area, from a URL and expanding the reading through the sub pages and hyperlinks present in the URL, creating a database that Enables a subsequent search for existing expressions on the sites visited; That is, they visit websites, follow links on pages, and record the data of these links from each visited page, to facilitate indexing in a database and the search engine membership.

Thus enabling automatic retrieval of web data and updating the database, facilitating the indexing of downloaded content, thereby promoting faster searches.

Through the OpenWebSpider, it is possible, for example, to index a site and know how many times and in which places a given term appears. A rudimentary option to this mechanism would be to fetch manually, page the page, the term searched, which could lead to exhaustion and acquisition of efficient few results.

OpenWebSpider uses the GNU General Public license (GPL) and all Free software (gcc, MySQL, Apache, and PHP). The platforms where the OpenWebSpider is tested are: Windows and Linux. It is often possible to compile it on other platforms, but is not officially supported.

For more information visit the Web Page www.openwebspider.org.

• • •

Check FAQS

Frequently Asked Questions

Who can use the Hultig-C? Hultig-C supports the education, research and development of technology related to computational linguistics, sharing resources, such as data, tools, and patterns. Thus aiming at all those who have interests for related areas, and develop or intend to develop multilingual programs, and that consequently need raw material to support the work that intend develop.

How to get the Hultig-C? The Hultig-C data sets are distributed by the Human Language Technology Center and Bioinformatics (HULTIG), contact us.

What's OpenWebSpider? OpenWebSpider is a program that can be used to create a search service, whose purpose is to visit websites, read your pages and create an index of entries for a search engine.

What license does OpenWebSpider use? OpenWebSpider uses the GNU General Public license (GPL) and all Free software (gcc, MySQL, Apache, and PHP).

Where can I get more information about Hultig-C? You can send a message to the Hultig team.

Contact us

University of Beira Interior
Department of Informatics
Rua Marquês d'Ávila e Bolama.
6201-001 Covilhã-Portugal.

                         ☎ Telefone: +351 275 242 081 (ext.: 1601)
📠 Fax: +351 275 319 899
✉ Hultig: hultig@di.ubi.pt
                 ✉ Hultig-C: hultig-corpus@di.ubi.pt