To construct a web corpus, one needs data. A common way for this data to be collected is by crawling. Web crawling is the process of finding large numbers of web pages by extracting and following hyperlinks from documents that have already been downloaded. The Web Spider component builds our document collections since we use the web as a text source for the HULTIG-C. Therefore, it is essential to have an effective tool to download web pages. This is why the web crawler is a critical component of every web-based corpus development project. Many freely accessible and robust crawlers can be used for similar projects such as Apache Nutch, crawler4j, Scrapy, YaCy, WebSPHINX and JSpider. However, many of them are complex, needing extra tools to function effectively, including unnecessary features, while missing some essential features regarding web corpus construction.
As a result, our research group has built a high-performance web crawler, the HultigCrawler, and used it in the HULTIG-C construction process. HultigCrawler is an open-source program developed by the same team, and it is available publicly that can be used to build web page corpus and similar purposes through the crawler. Hultig crawler continuously crawls websites, and it has been indexing since January 2020 in our case. The Web Spider component builds our document collections since we use the web as a text source for the HULTIG-C. Therefore, it is essential to have an effective tool to download web pages. This is why the web crawler is a critical component of every web-based corpus development project. Many freely accessible and robust crawlers can be used for similar projects such as Apache Nutch, crawler4j, Scrapy, YaCy, WebSPHINX and JSpider. However, many of them are complex, needing extra tools to function effectively, including unnecessary features, while missing some essential features regarding web corpus construction.
As a result, our research group has built a high-performance web crawler, the HultigCrawler, and used it in the HULTIG-C construction process. HultigCrawler is an open-source program developed by the same team, and it is available publicly that can be used to build web page corpus and similar purposes through the crawler. HultigCrawler continuously crawls websites, and it has been indexing since January 2019 in our case.
HultigCrawler is a python-based web crawler that crawls given websites and extracts specified data from their pages. The python library called Scrapy is used for website crawling purposes, while another library known as Beautiful Soap is employed to parse the HTML web pages. In the first step, we give the base URL of the website of interest as an input to the HultigCrawler. It starts the crawl process by making requests to the URL defined, obtaining the response object, which is then looped through the elements yielding a Python dict with the extracted items, and finally looking for a link to the next page and scheduling another request using the same process. The crawler runs the process until there are no more web pages left to explore. The crawled data falls under the categories of URL, Tags, Title and Text, which are then saved into the MariaDB database using configured pipelines.
For more information, visit the web page HultigCrawler.