This project represents the first opportunity to explore how Big Data technologies can enhance our work with text corpora and will extend the work done in the Language Technology Group project NTAP by:
- developing an improved methodology, and associated tools, for creating large-scale topically-focussed blog corpora comprising text, link and date data. Improvements will be both in the data extraction techniques, and, through using Hadoop, speed of execution, e.g. we expect to reduce from c. 30 days to c. 3 days to run extraction.
- delivering three "climate change" corpora for English, French, Norwegian (as per interests of main collaborators Dag Elgesem and Kjersti Fløttum at University of Bergen UiB).
So far we have generated climate change blog corpora totalling about 6 billion words from about 13 million blog posts. For details of the corpora and the methods used to generate them, see Salway et al 2016: "Topically-focused Blog Corpora for Multiple Languages", Proc. of the 10th Web as Corpus workshop (WAC-X).
The main challenges:
- identify relevant blogs, versus non-blogs and blogs about other topics.
- extract text, link and date data from html.
A strong motivation:
Since the 1990s blogs have emerged as an important medium in which users can easily create and share content on the Internet. The emergence of the blogosphere has brought changes to the online public sphere, to the role of the mainstream media, to the production, contestation and dissemination of scientific knowledge, and to political deliberation. As a site for large-scale discourses about socially-relevant issues, the blogosphere has received considerable attention from social scientists during the last decade (Rettberg, 2013; Bruns and Jacobs, 2006).
Despite the great interest in the content of the blogosphere, there is a lack of commonly available large-scale blog corpora to support empirical research. Most blog corpora created for social science research have been relatively small since they were concentrated on what were perceived to be the most important blogs for certain research questions (e.g. Adamic and Glance, 2005; Song et al., 2007; Sharman, 2014). Larger blog corpora have been created but these were not focused on particular topics or were not designed to support social science research (e.g. Glance et al., 2004; Bansal and Koudas, 2007; Kehoe and Gee, 2012; Meinel et al., 2015).
The project starts with html data already harvested during NTAP, i.e. the html content of English, French and Norwegian blogs. Work in NTAP also analysed the material to determine methods for extracting text, date and link data. CBDA's task is to implement these extraction techniques in Hadoop, primarily by writing and wrapping a Python program, and to use other Hadoop functionality to analyse the extracted data, e.g. to count the frequencies of climate change terms, the distribution of dates of blog posts, links between blog posts.
The tools developed will be reusable directly to create other blog corpora, and may be adapted to harvest other web and social media material.
The corpora will be made available to researchers on request. They will also be available for analysis through the web front-end of other related projects of the Language Technology Group at Uni Research Computing, such as Corpuscle and CLARINO.