GERMANY: Using text mining methods to analyse public communication of our auditees
The German SAI performed data-driven audits on publicly accessible datasets to explore new data-analytic methods and tools. In one of these audit exercises, we analysed federal communication activities and their impact on the public image of digital legislation. The purpose of our audit work was to provide insights into the contents and topics communicated by federal government departments and agencies and to explore if and how these influenced the public image of the new legislation.
The analytical approach presented several challenges. First of all, communication is primarily available in writing. Therefore, data is only available as unstructured, non-numerical data. Secondly, the data may be publicly available and accessible, but widely spread over various websites, online presences, and platforms.
For data collection, we applied several methods. We used web scraping to extract relevant texts from websites of government bodies that are responsible for the implementation of the new legislation. We automatically accessed relevant press releases or information items on these websites and stored their contents as plain text with corresponding metadata (i.e. title, publication date and type, government body). Furthermore, we used API interfaces to collect posts on social media platforms. Lastly, as we were also interested in the public image of the new law, we collected texts from news articles and related social media posts by citizens.
Using this data, we constructed a corpus of documents, that formed the basis for further analysis. We began by calculating basic descriptive statistics that helped us to gain an overview of the collected documents. For example, we calculated readability scores to identify if the communication was broadly understandable for citizens. This is of major importance to a topic such as digitalisation. Figure 1 shows a distributional plot of calculated readability scores grouped by four federal ministries. It can be seen, for example, that ministry 4 has more concentrated score values than ministry 1, but also some high and low outliers. In a next step, we used natural language processing (NLP) to tag the data by assigning the proper grammatical category to each word. This enabled us to study which nouns or adjectives were used, and if these provided evidence of uniform communication. Moreover, we used the NLP feature of entity tagging to automatically identify entities like government bodies or natural persons named in our documents. This was used to collect information on whether communication was aimed at specific target groups, i.e. citizens.
Lastly, we vectorised our corpus to construct a document term matrix (DTM). A DTM describes the word frequency in a collection of documents. We used this DTM for several different techniques. For example, we used dictionary-based approaches to calculate sentiment scores of documents and social media posts to find out if documents contained positive or negative sentiments. We also used the topic modelling technique to identify homogenous topics and the calculation of similarity scores to detect similar documents and reposts. Figure 2 is a heatmap showing document-to-document similarity scores. Similar documents are displayed in red colour. Furthermore, we conducted basic cluster analyses. The purpose was to identify clusters within the communication of the government bodies.
When combining the results of these different methods and approaches applied to unstructured data, we gained thorough insights into our auditees’ public communication practices and their impact.
Figure 1. Readability scores for different federal ministries
Figure 2. Heatmap of cosine-similarities in documents of a federal ministry