Big Data and Sustainable Development in the Age of Inequality

By Clarissa Ai Ling Lee

Cloud data

Cloud data

In a solutions-driven world, recommendations to effect change are only deemed trustworthy when solidly backed by impeccable data. The strength of the recommendations is evaluated by the rigour of the analyses and inferences based off evidence gathered from the ground - evidence that will ensure that the recommendations are well-anchored despite the chaos and imperfections of reality. Data - a term used to represent informational currency and knowledge capital - has long been a subject of multidisciplinary interest and interrogation of an ontological and epistemological nature. Data form the basis of arguments, comparisons and justifications.

On the other side, we have sustainable development - an oxymoronic juxtaposition of two words that has burned through too many trees and millions of digital bytes as scholars fretted over the significance and utility of having these two words placed side by side, including whether proposals towards the deployment of this phrase for the benefit of mankind might not actually contribute to the latter’s troubles. Yet, sustainable development is a term so powerful that governments around the world are forming committees to look into ways of mainstreaming Sustainable Development Goals into their national development plans – a form of positive peer pressure, one might say.

Sustainable development, being the umbrella where a multiplicity of interests could either collide or align, requires the management of aggregated and disaggregated data points developed out of different disciplinary interests, epistemic scales, collection instruments, computational tools, methodologies, formats (qualitative and quantitative) and motivations for collecting that data. Although one might form a hypothesis, or series of hypotheses, regarding solutions to specific problems outlined under a sustainable development goal, sufficient information would have to be attained before one could confirm or disconfirm that hypothesis – if either were possible. In most cases, one would encounter missing data or imperfect information, and might have to draw extrapolations from available data and then deploy an extant model for analysing said data, or even produce extensions to the existing model, if not a new model altogether, as a result of the analyses.

However, what happens when one finds oneself faced with conditions of data poverty - whether due to the lack of resources to collate and collect the sought data, or due to insufficient priority placed on their collection? One might end up having to generate hypothetical data based on assumptions and the deployment of certain models. In choosing to apply a model that had grown out of particular data sources to data with similar technical characteristics but different socio-political and cultural objectives and implications, one must then consider the degree by which the model is applicable to the latter datasets without compromising or constraining the revelatory potential of the datasets.

Although one could argue that a model is sufficiently generalizable so that it is applicable to locally derived data, how one operationalizes the model would still be one’s response to, and assumptions about, the conditions described. Although mathematical equations and abstracted algorithms could be built into quantitative models to work with signal noise, non-linear correlations, and complexity further complicated by a multiplicity of variables at every sub-level of computation, the models would still be predicated on certain ideals or predictions regarding how possible scenarios could unfold. These would be based on known trends, with some standard deviations thrown in, and potential null hypothesis considered. As for qualitative data, although there might be a discernible pattern, for instance, in how human subjects might respond when questioned, or behave, under particular circumstances; intentional attempts by the researcher to insert or introduce new variables, either in a planned or ad-hoc manner, could elicit unprecedented responses and produce major departures, ceteris paribus, from the status quo.

In addition, conditions might be such that data from a targeted locality might not be readily accessible (either classified without the mechanism in place to declassify; or kept in a format that could not be accessed without specialised equipment), mislaid (a very strong possibility, particularly before the age of computerised record-keeping and digitization; or if the record-keeping infrastructure is poor), or corrupted. In ‘newer’ democracies that had remained for a long time under imperialism (including multiple/continuous colonialism), older data might be scattered across different archives located in several geographical locations, or may have been lost during the period of transition into self-determination.

Although one could backcast from the future to the present with available contemporary data and projected hypothetical data based on how one relates the present to the future, the process of forecasting still requires sufficient data for detecting anomalies, as well as weaker signals that could either reinforce or serve as a foil to the stronger signals. At the same time, backcasting is only really useful if one has an ability to gauge between the knowns and unknowns. This ability comes out of being able to discern connections that might not be obvious until such a time that sufficiently big enough datasets are available to foreground those connections. These datasets are the ‘big data’ that drive the data-intensive science towards contributing to implementable findings  that will inform the programmes designed with the ultimate intention of advancing the 17 sustainable development goals, especially in the big science of climate change and evidence-based medicine/healthcare.

In JSC’s inaugural ASEAN Ministers Workshop, data poverty was discussed as one of the major obstacles in designing suitable sustainable development projects, largely because in the least developed ASEAN countries, poor infrastructures make the collection and maintenance of useful data a challenge. In addition, political conditions of certain states could exacerbate conditions of data poverty. The lack of data could make the best intentions potentially harmful to communities where projects are implemented in the name of sustainable development, especially when the assumption of the potential success of these projects are based on idealized hypothesis; a failure to account for the necessity of maintenance, whether of the projects or their outcomes; disconnection from realities on the ground; and neglect in considering how the outcome of the projects could be integrated to local cultures and lifestyles.

Therefore, it is necessary for research aimed at supporting implementation and policy development to design methodologies that will work not merely with minor data gaps, but also where entire chunks of data are missing, especially for research conducted in data-impoverished locations. The development of such methodologies would require creativity on the part of the researcher and also collaboration, not merely among academic researchers, but also with the public and private sectors who would generate data for the purpose of governance, enterprise, and social-development. When attempting to develop projects for meeting the SDGs, it is crucial to form transnational partnerships while developing strategies for strong cross-cultural collaborations.

All of the above challenges would have to be tackled in order for applied research to translate into policies that could determine the course of action for meeting the SDGs in ASEAN. This might include data-mining in unconventional places or utilising less standard methods - such as informal approaches involving native informants and crowd-sourcing, both of which have been deployed in citizen science projects and other humanistic endeavours. Data poverty and data access are issues that require urgent attention to counter the hegemony of knowledge derived from more data-accessible and data-rich environments, so that the weak signals from under-represented knowledge regions are not drowned out by the more dominant knowledge signals.