The ICT4Peace Foundation recently published The potential and challenges of open data for crisis information management and aid efficiency: A preliminary assessment, a short monograph in large part based on a longer report I wrote on Big and Open data about a month ago. It’s a work in progress. Our understanding and support of Big and Open Data is evolving rapidly – with new concerns as well as new found optimism even in such bureaucratic catacombs like India. In that sense, whatever is written on the topic is very quickly outdated. A personal challenge also remains to keep abreast of these developments in the midst of my own work in Sri Lanka, which this year I’m focussing more energy on. So warts and all, here’s my larger essay on Big and Open Data – and the sections I’m most interested in fleshing out and discussing more are around the ethics of generation, use and dissemination, the potential for investigative journalism that harnesses the skills and expertise of consumers, and the potential for peacebuilding.
Download as a PDF here.
What is Big Data? What is Open Data?
Big data is the new black. From business analytics to governance that it potentially more accountable and efficient, Big Data and the ease with which citizens can access is redefining how governments, civil society and business operate. References to Big Data have grown exponentially over 2011, and examples of big data abound, sometimes in the unlikeliest of places that have hitherto sought to contain and control information. Big and Open Data is also being driven by what the authors of The Role of Crowdsourcing for Better Governance in Fragile State Contexts call the philosophy of ‘open-source governance’, that advocates an intellectual link between the principles of the open-source and open-content movements, and basic democratic principles. With the objective of enabling ordinary citizens to contribute directly to the formation of policy, open-source governance theoretically provides more direct means to affect change than do periodic elections. In addition to drivers for more open data from the ground by citizens, multinational corporations, Bretton Woods institutions and the United Nations system is also embracing Big Data. The World Bank’s online Data Catalog provides download access to over 7,000 indicators from World Bank data sets. The World Bank also currently has three different APIs to provide access to different datasets: one for Indicators (or time series data), one for Projects (or data on the World Bank’s operations), and one for the World Bank financial data (World Bank Finances API). The visualisations that help decision making and planning from Google’s Public Data repository are made possible by Big Data. http://DataCatalogs.org lists 212 open data catalogs from around the world. http://wiki.civiccommons.org/Initiatives is largely US centric, but lists dozens of sources also from around the world with Open Data initiatives. Hundreds if not thousands of open datasets already exist, and platforms like the Open Knowledge Foundation’s CKAN datahub help ordinary users access them. There is also a growing interest amongst citizens in using and accessing Big and Open Data. As a recent article on Big Data published in the New York Times notes,
What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions. There is a lot more data, all the time, growing at 50 percent a year, or more than doubling every two years, estimates IDC, a technology research firm. It’s not just more streams of data, but entirely new ones. For example, there are now countless digital sensors worldwide in industrial equipment, automobiles, electrical meters and shipping crates. They can measure and communicate location, movement, vibration, temperature, humidity, even chemical changes in the air.
Link these communicating sensors to computing intelligence and you see the rise of what is called the Internet of Things or the Industrial Internet. Improved access to information is also fueling the Big Data trend. For example, government data — employment figures and other information — has been steadily migrating onto the Web.
The Human Genome Project was perhaps the first celebrated instance of Big Data as “arguably the most valuable dataset the biomedical research community has ever known”. Access and use of this data for free and unrestricted for scientists worldwide. As the Human Genome Project noted over a decade ago,
Translating the text of the human genome into practical applications that will alleviate suffering is one of the greatest challenges facing humankind. This mission will require the work of tens of thousands of scientists throughout the world. No scientist wanting to advance this cause should be denied the opportunity to do so for lack of access to raw genomic data. Delaying the release of either unfinished or finished genomic DNA sequence data serves no scientific or societal purpose.
The HGP’s prescience is even more relevant today. Can HGP’s commendably openness be applied to the domains of socio-political analytics, crisis information management, Big Data around pandemics, the prevention of famine, conflict early warning, complex political emergencies and the support of peace processes? Each of these are extremely complex systems that generate Big Data. This Big Data in turn gives the potential to engage in these complex processes in a better, more strategic, informed, timely, effective and efficient manner. Big Data undergirds a lot of developmental and humanitarian work even today, and its importance and impact will only grow.
The ‘big’ in Big Data is a moveable, moving target. Big data moveable by advances in transmission speed and technologies over the Internet, making it possible for even raw scientific data to be transferred to multiple locations around the world for offline and offsite analysis and visualisation. The Square Kilometre Array (SKA) project, for example, once fully online will redefine Big Data not just for the scientific community, but possibly go on to re-engineer supercomputing and data transmission technologies as we know them today, given that data transport from the radio dishes will produce ten times the current global internet traffic. What is big today in terms of a dataset will be dwarfed tomorrow by SKA’s raw output, leave aside ancillary operations. But it’s not just big science that produces Big Data. As the World Economic Forum flags,
Sources such as online or mobile financial transactions, social media traffic, and GPS coordinates now generate over 2.5 quintillion bytes of so-called ‘big data’ every day. And the growth of mobile data traffic from subscribers in emerging markets is expected to exceed 100% annually through 2015.
It is clear then that we are awash in data. Yet ironically, in this information glut, knowledge is an increasingly precious commodity. There are also growing concerns about the use and licensing of Big Data, with allegations that big corporations are locking in data generated by users, which in fact should be owned by them to do as they please, to their platforms, services and products. At stake is a global, interoperable, accessible data commons, nourished not just by datasets published by the UN, multinational corporations, civil society, NGOs and regional organisations, but by the aggregate information inflows from the billions of transactions made by citizens today using their mobiles and computers, augmented by remote sensing data as well as satellite imagery. From literally the grassroots to geo-stationary orbit, we are enmeshed in information pathways that are opening up to expose hidden data that governments and other actors are soon realising are too complex and large for themselves to analyse. Big Data, through gaming problems and encouraging citizens to make use of public data through APIs, is helping people help themselves, and through this process, help governments and the UN better understand the dynamics of a complex world. Big Data is no panacea – it’s mere existence signifies nothing other than 1s and 0s on a hard drive spinning in a data centre. Significant challenges and enduring questions over accessibility, availability, usage rights, platform and API development will shape our appreciation of Big and Open Data in the years to come. This brief paper will look at some of these issues, and also place Big Data within the context of peacebuilding, media and governance.
Overall impact on global information situation
Big Data allows us to extrapolate from large datasets vital patterns and indicators that can influence, progressively or otherwise, the lives of individuals on the ground. At its most benign, it can help, for example, traffic and pedestrian flows in mega-cities so that congestion is eased, energy consumption is reduced and through infrastructure and public services that respond to socio-cultural, seasonal and geographical interaction patterns (e.g. waste disposal, municipal and local government offices, public spaces like parks, parking, entertainment centres, libraries and zoning) to significantly improve the quality of life of residents. NYC OpenData, the map of block by block energy consumption in New York, Open Data Manchester, The City of Vancouver Open Data Catalogue, http://data.gov.uk and the many other open data initiatives within the UK, Europe Public Data, http://opendata.go.ke and toronto.ca/open are just a few good examples on this score. The New York City Checkbook site goes further,
A new version of the site expected to launch this year will grant access to detailed information about the contracts related to each individual payment, city vendors and the contracts they’re working on, and how payments made to date stack up against each agency’s budget. The site will also include information about incoming revenues and should also play host to a collection of audits, budget reports and other analysis produced by the network of entities responsible for poring over the city’s finances. “Our goal is to make New York City the most financially transparent government in the United States,” said Ari Hoffnung, who is now the city’s deputy comptroller for public affairs but had been involved in launching the original Checkbook project while working in another part of the Comptroller’s office. The new version of the site, which Hoffnung is calling Checkbook 2.0, is expected to have an application programming interface at launch. In theory, this could enable third-party developers to pull raw data from the city and use them to build anything from a site that tracks how much the city spends on pizza parties for employees to one that provides a daily look at how tax revenues are sizing up against expectations.
Visionary multi-nodal initiatives like Europe Citadel on the Move and FuturICT, out of Switzerland, will use Big Data to “integrate ICTs, complexity science and the social sciences to create a paradigm shift, facilitating a symbiotic co-evolution of ICT and society. Data from FuturICT’s complex globe-spanning ICT system will be leveraged to develop models of techno-socio-economic systems”. Simply put, this is putting Big Data to good use – an SKA type initiative focussed not on the heavens, but on improving and making sense of our increasingly complex terrestrial lives. And there are similar initiatives to contribute to Big Data from large, international developmental agencies that are anchored to opening up aid during horrible famines to just being more transparent about aid delivery in general.
Given the growth in megacities over the next 10 – 15 years, Open Data that allows inhabitants to firstly imagine and implement solutions to address with poor service delivery, corruption, congestion, pollution and issues with the general quality of life and also help them, in collaboration with or in spite of municipal and local government, foster evidence based and data driven debate on reform, accountability and improvements to efficiency. Big Data in other words is going to be central to governance not just at the national level, but at the sub-national levels as well. And just as Big Data will contribute to solutions, these solutions in turn will generate more data, that will feed into ever bigger datasets by governments and other actors, including civil society. At the time of writing for example, the just the International OGD Catalog Search alone indexed 710,058 datasets from 115 catalogs in 16 languages representing 32 countries and international organisations. Bets are off for how much on this site alone the Big Data catalogued will grow over 2012.
Importance for humanitarian response, crisis info management
The crisis information strategy (CiM) strategy is based on the recognition that the UN – its member states and constituent agencies – have significant experience in the identification, prevention, mitigation, response and recovery of all types of crises, natural as well as man-made. The CiM strategy is driven by the UN Office of Information and Communications Technology, based in New York and supported by the ICT4Peace Foundation. Importantly, the CiM strategy flagged, way back in 2008, that crises today require information to be shared within and between agencies of the UN and its partners, and also that interoperability of CIM systems is critical in this regard. After the devastating earthquake in Haiti in 2010, the UN system’s resistance to Big Data from outside its constituent agencies and departments lessened. A number of factors contributed to this.
- Crises over 2010, most notably Haiti in January, demonstrated clearly that actors outside the UN system are today, with their own sophisticated technology platforms and an approach to distributed work, vital first responders producing, disseminating and archiving information before and often despite the UN and governmental actors.
- Challenges within the UN system to respond to crises of Haiti’s scale and the frequency with which such disasters occurred over the course of the day. The Head of UN OCHA, Sir John Holmes, in a strongly worded email in February 2010 expressed his frustration over the UN’s aid effort in Haiti, noting that “only a few clusters have fully dedicated cluster coordinators, information-management focal points and technical support capacity” and adding that the disjointed effort is casting doubts on the UN’s ability to effectively provide relief.
- The creation and curation of the ICT4Peace Foundation’s Crisis Information Management (CiM) wikis for Haiti, Kyrgyzstan, the Gulf Oil Spill, the Chilean earthquake and the Pakistan floods engaged with hundreds of data sources for each wiki, both within the UN system and outside it. Sources like local new and social media and mainstream media on the web in particular stood out as vital resources.
- Importantly, the Foundation’s CiM wikis demonstrated that even for COD datasets, key information after a sudden-onset crisis often came from sources outside the UN and official Governments channels – for example with mapping and GIS information, health infrastructure related information, ground conditions and ad hoc IDP camp locations.
Common operational datasets are predictable, core sets of data needed to support operations and decision-making for all actors in a humanitarian response. Some of the CODs, such as data on the affected population and damage to infrastructure, will change during the different phases of the response and therefore will need to be frequently updated and maintained. Other CODs, such as rivers and village locations, are likely to remain the same throughout the response. The CODs are proactively identified and maintained prior to an emergency as part of data preparedness measures and made available by the OCHA (or pre-agreed in-country alternate) within 48 hours of a given humanitarian emergency. All CODs must meet minimum criteria for format and attribute information in accordance with national standards.
Fundamental operational datasets are datasets required to support multiple cluster/sector operations and complement the common operational datasets. These datasets are characterised by thematic areas (such as education facilities) and are made available as soon as possible after the onset of an emergency given availability.
The COD/FOD platform can be seen as Big Data from the UN, made available publicly. The platform is now also nourished through data fed by non-traditional UN sources. Less than a year after it went live, as of mid-December 2011, 142 COD datasets were registered on the platform – 76 of which are for priority countries. 36 countries have at least one COD registered. Furthermore, 33 FOD datasets are registered, of which 25 are for priority countries. 16 countries have at least one FOD registered. In a report (Connecting Crisismapping to the United Nations) published as early as October 2010, the ICT4Peace Foundation outlined the possibilities for and potential of crowdsourced information to fertilise UN OCHA’s CODs and FODs.
…at the retreat of CiMAG on 5 October 2010 at IBM Palisades in New York, the ICT4Peace Foundation underscored the point made in the IASC draft document on CODs that “some of the CODs, such as data on the affected population and damage to infrastructure, will change during the different phases of the response and therefore will need to be frequently updated and maintained.” Even without an expansion of the existing CODs, the Foundation’s submission was that crisismapping could and should feed into the updating of UN datasets after a crisis occurs, complementing timely efforts by UN agencies and the cluster leads to update this information.
From the mediagenic and well-known initiatives like Global Pulse to a number of lesser known but no less revolutionary intra-agency platforms and projects to harness Big and Open Data, the UN is, far more rapidly than anticipated at the time of this report’s release, shifting gears to both make its own data open, and open it’s analysis and decision making frameworks to information sourced from datasets outside it’s traditional domains of comfort and control. This is a good thing, a positive development, and needs to be strengthened and supported.
Importance for peacebuilding and democratic governance
In August 2011, Fast Company ran a story on how the US Department of Defence will make otherwise hard to access data on war and conflict available to academics, and help crowdsource military tactics. As the story notes,
The massive, publicly accessible conflict data archive called “The Empirical Studies of Conflict (ESOC)” project, headed up by Stanford and Princeton University academics, will also publish working papers and other research showing their findings. ESOC is funded by a five-year, $8.6 million Defence Department grant. According to Stanford University literature, the Defence Department’s goal is to make hard-to-find data on conflicts and insurgencies available to the academic community… Traditionally, academics in the social sciences (anthropology, economics, political science, etc.) have hesitated to work with the Defence Department directly. Making conflict data publicly available allows the military to sidestep what they perceive as academic bias. The raw data being used by ESOC comes from a variety of sources. Apart from military and government records, the project also makes use of more unorthodox sources – among other things, more than 700 files belonging to Al Qaeda in Iraq and battlefront materials from a variety of conflicts.
As contentious as this sounds, it is a trend that will grow, and academics even before the US DoD’s project were also experimenting with. Given that the military and intelligence services have some of the comprehensive datasets on their respective countries, and in some cases, on other countries, we can expect subsets of these big datasets – which are themselves very large – to be parcelled out and openly published. Although commonly associated with big business, ‘data philanthropy’ by governments – the anonymisation and sharing of even sensitive data for public use – will grow, and aid peacebuilding by providing new insights into complex political emergencies, peace negotiations and peace processes, including post-war state-building and reconciliation. Yet, as the Fast Company article goes on to note,
Of course, the military isn’t paying to make all this conflict data available for altruistic reasons. By providing non-classified and declassified data sets and other information caches to academics and researchers, the Defence Department can essentially crowdsource research and analysis.
The impact however of using these datasets can be much more than improving the efficiency and effectiveness of wars. It can contribute to healing, understanding and the identification of socio-economic, ethnic, cultural, religious, partisan, political, tribal and other patterns that help early warning, mitigation, response and recovery. Aid agencies working in some of the most complex political emergencies today are already realising the potential of Open Data to transform their work on the ground. The Visualising Afghanistan initiative by The Asia Foundation notes,
The Open Data Movement, the notion that certain types of should data be “open” and available to all for analysis and evaluation, is an idea that’s creating new opportunities to strengthen development efforts across many fields. For The Asia Foundation, we see opportunities in areas like transparency, accountability, and good governance where open data can help enhance efforts to build more responsive and effective institutions of governance. We developed Visualising Afghanistan to bring even more accessibility to our survey data in Afghanistan. We invite you to have a look, download the data, and build your own visualisation.
Note the Foundation’s appeal for users to create their own visualisations using Open Data, key to helping crowdsource perspective that can in return help better understand complexity on the ground, and support conflict mitigation work. The potential of this line of thought is captured by Robert Kirkpatrick, Director of UN Global Pulse at the Executive Office of the Secretary-General in an article that looks at how Big Data is always a dataset in a single location or website, but needs to also be seen as the aggregation of thousands if not millions of disparate nodes – from remote sensing information to SMSs by citizens – and the potential this holds for massive, real time and regressive data analysis that can save lives. As he avers,
Global Pulse is looking… for the smoke signals that could reveal collective changes in behaviour related to incipient harm in many areas of human life. Our Pulse Labs – beginning with Pulse Lab Kampala late this summer – will allow governments, development organisations, academe and the private sector to come together, experiment with new data, tools and approaches, and develop a formal methodology whereby the earliest smoke signals are used to trigger a process of investigation, verification, and response. We will likely need to analyse data gathered through remote sensing, social network analysis, news media mining, and crowdsourcing of citizen reports, and combine it with traditional indicators already used in crisis monitoring. Only then might we be in a position to understand the underlying causes and risks of future impacts on health, nutrition, livelihoods, or education, and to use this information to improve our capacity to protect vulnerable populations from harm.
Although Kirkpatrick does not refer to violent conflict or peacebuilding, the applicability of the models of sense-making and almost real-time analysis that GlobalPulse is developing in the domains of conflict resolution, transformation, early warning and mitigation is obvious. And though the use of the phrase ‘data exhaust’ invariably brings to mind the toxicity of fumes, Kirkpatrick’s point that the by-products of online and mobile interactions when collated and analysed can give vital indicators of a region’s risk vectors is an important one. It is similar to the approach used by PAX, which “plans to launch a global digital system to give early warning of wars and genocide”. But Big Data is not just about what is now, and possibly what will be in the future. As flagged by GigaOM,
Benetech is a non-profit organisation that has been contracted by the likes of Amnesty International and Human Rights Watch to address controversial geopolitical issues through data science. Several years ago, they were contracted to analyse a massive trove of secret files from Guatemala’s National Police that were discovered in an abandoned munitions depot. The documents, of which there were over 80 million, detailed state-sanctioned arrests and disappearances that occurred during the country’s decades-long civil conflict that occurred between 1960 and 1996. There had long been whispers of a genocide against the country’s Mayan population during that period, but no hard evidence had previously emerged to verify these claims. Benetech’s scientists set up a random sample of the data to analyse its content for details on missing victims from the decades-long conflict. After exhaustive analysis, Benetech was able to come to the grim conclusion that genocide had in fact occurred in Guatemala. In the process, they were able to give closure to grieving relatives that had wondered about the fate of their loved ones for decades.
The potential of massive datasets in peacebuilding also extends to image analysis of possible war crimes and crimes against humanity. Ushahidi’s Patrick Meier has written about how the members of the Standby Volunteer Task Force (SBTF) Satellite Team tagged the location of hundreds of Syrian tanks and other heavy military equipment on the Tomnod micro-tasking platform using very recent high-resolution satellite imagery provided by Digital Globe. And although it isn’t Open Data, the LRA Crisis Tracker also uses a number of sources (from those on the ground to situation reports from the UN system and other actors) to present a temporal and geo-spatial representation of one of the world’s most brutal terrorist groups, in one of the world’s most unstable regions. Seeing this data presented thus on the web usually compels a reaction, which most often is to find out more about the conflict and ways to help. It presents a complex scenario in a visually compelling form, and weaves multiple datasets onto a single map to present an interactive website that bears witness, and strengthens advocacy in support of conflict transformation, and constructive intervention.
Investigative journalism also benefits from Big Data, though its potential is framed by the ability of journalists to grapple with the tools and platforms that help them manipulate large datasets in their reporting. Using tools like Google Refine and Google Fusion Tables to go through large datasets about campaign financing or tenders awarded over a certain period can yield interesting insights that hold public officials, governments and other actors accountable. Analytical tools and platforms (R) as well as new models of journalism that make available source material and in essence contribute to large datasets of source / reference material (DocumentCloud), new journalism platforms (OpenPublish), new analytical tools (OpenCalais, Recorded Future) and data visualisation platforms (the open source Weave is specifically geared to help with Big Data, the free Tableau Public, IBM Many Eyes and Overview, the News Challenge winner of Associated Press which is built ground up to deal with Big Data for media) are all pushing civic media as well as mainstream journalism to use Big Data, and also add to it. Data journalism is a term being used more and more to help understand how media can both inform, and be informed by Open Data. Newspapers like the Guardian have entire sections online now dedicated to data driven journalism and have even gone on to create APIs and other tools to help subscribers and visitors to its website help the paper go through Big Data for key stories. The Guardian’s tools to help readers deal with Big Data are free to boot. Online initiatives like Straight Statistics aims to demystify what can be very complex datasets, and as the site notes, “improve the understanding and use of statistics by government, politicians, companies, advertisers and the mass media. By exposing bad practice and rewarding good, we aim to restore public confidence in statistics”. Big Data is even creating new institutions that leverage the relatively low cost of online publishing with the power of crowdsourced Big Data analysis and reporting. A good model is the Sunlight Foundation, which “develops and encourages new government policies to make it more open and transparent, facilitates searchable, sortable and machine readable databases, builds tools and websites to enable easy access to information, fosters distributed research projects as an community building tool, engages in advocacy for 21st century laws to require that government make data available in real time and trains thousands of journalists and citizens in using data and the web to watchdog Washington”. Another cogent example that highlights the intersection of Big Data and journalism is ScaperWiki, and the potential of projects like Violence Against Journalists in Afghanistan and in particular, the Mine project, in the service of investigative journalism (think of for example decades of human rights violations, or disappearances that have been tracked by various NGOs, which can be mined and visualised for patterns, advocacy and even accountability).
Given that the media plays such a central role conflict early warning, prevention, mitigation, transformation, recovery and reconciliation, these new platforms and tools – that are not just for mainstream media but for new media initiatives that are community focussed, citizen centric, and activist in nature – can through Big and Open data published online more than ever before contribute to public debates on inconvenient truths as well as furnish progressive ideas for reform and democracy promotion. These tools aren’t the exclusive domain of progressive actors, and will be used by those intent on control and misinformation. But the barriers to access Big Data, even if one does not have programming expertise, are coming down, and new ideas that use Big Data published by others to support investigative journalism are growing, and this can only help strengthen peacebuilding.
Potential pitfalls, downside
Wikileaks is Big Data, yet none of the reference material read for this paper once refers to it. http://cablesearch.org and http://cablegatesearch.net are two equally competent platforms that mine Wikileaks for its tranche of US diplomatic cables, first released in 2010. OpenLeaks describes itself as a “a project that aims at making whistle-blowing safer and more widespread. This will be done by providing dedicated and generally free services to whistleblowers and organisations interested in transparency.” The so-called hacktivist group Anonymous has leaked hundreds of thousands of emails and gigabytes of information that were fire-walled, most recently the emails of the tyrannic Syrian President. Terabytes even petabytes of information from whistleblowers and a global community of hackers will invariably post the ethical challenge to Big Data – what datasets are acceptable for use and what are not? Recalling the section on Big Data licensing, is there some marker, some standard that prescribed the use of once confidential material placed in the public domain, and furthermore, information products (reports, visualisations, situation reports) that include these datasets as source material?
Wikileaks was picked up by repressive governments and activists in those countries with equal fervour – one seeking to outwit the other in ascertaining the nature and degree of confidential information and action that could be taken as a result. This information was used for damage control as well as the targeting of activists. Assange often makes the point that he knows of no one killed on account of the cables released by Wikileaks. That may well be true, but the real terror amongst activists about their names being revealed, and the sheer embarrassment and loss of face the cables have caused those who, sometimes with valid reason, have projected a very different public persona, must also be taken into account when using the cables in reporting, referencing and activism. On the other hand, large datasets of confidential information in the public domain can also result in public accountability. As this article from GigaOM notes,
NYU Ph.D. student conducts comprehensive analysis of Wikileaks data for greater insight into the Afghanistan conflict: Drew Conway is a Ph.D. student at New York University who also runs the popular, data-centric Zero Intelligence Agents blog. Last year, he analysed several terabytes worth of Wikileaks data to determine key trends around U.S. and coalition troop activity in Afghanistan. Conway used the R statistics language first to sort the overall flow of information in the five Afghanistan regions, categorised by type of activity (enemy, neutral, ally), and then to identify key patterns from the data. His findings gave credence to a number of popular theories on troop activity there–that there were seasonal spikes in conflict with the Taliban and most coalition activity stemmed from the “Ring Road” that surrounds the capitol, Kabul, to name a few. Through this work, Conway helped the public glean additional insight into the state of affairs for American troops in Afghanistan and the high degree of combat they experienced there.
Yet juxtapose this with what Julian Assange said of the release of US cables via Wikileaks related to Afghanistan,
he insisted that any risk to informants’ lives was outweighed by the overall importance of publishing the information. Mr Assange said: “No one has been harmed, but should anyone come to harm of course that would be a matter of deep regret – our goal is justice to innocents, not to harm them. That said, if we were forced into a position of publishing all of the archives or none of the archives we would publish all of the archives because it’s extremely important to the history of this war.”
These two very different examples about the same initiative (the datasets were different, though both were leaked and before being published on Wikileaks, confidential information) brings to sharp focus the challenge of banning outright the use of controversial Big Datasets, as well as the inherent dangers of Assange’s chutzpah, that this data in the public domain serves only to strengthen democracy, and progressive groups. The problem is replayed through other actors and platforms that are inextricably entwined with the generation of Big Data. For example, the same vigilante group that hacked into and published the current Syrian President’s email, also recently hacked into the FBI, US Department of Justice, Recording Industry Association of America and Motion Picture Association of America websites. It is very clear that Anonymous itself and other loose, global groups like it will prise open fire-walled, sensitive information and contribute significantly to Open Data. As this op-ed by a member of Anonymous on Al Jazeera notes,
Today, I predict that Anonymous and entities like it will become far more significant over the next few years than is expected by most of our similarly irrelevant pundits – and this will, no doubt, turn out to be just as much of an understatement as anything else that has been written on the subject.
The fact is that the technological infrastructure that allows these movements has been in place for well under a decade – but phenomena such as WikiLeaks and Anonymous have already appeared, expanded, and even become players within the geopolitical environment; others have come about since.This is the future, whether one approves or not, and the failure on the part of governments and media alike to understand, and contend with the rapid change now afoot, ought to remind everyone concerned why it is that this movement is necessary in the first place.
If understanding events, contextualising complex processes and designing strategic responses to save lives are going to be based on information that is illegal in some jurisdictions, and carries a heavy extra-legal penalty for using in other contexts, there needs to be an urgent discussion on what role information from whistleblowers in particular, as well as Big Data gleaned and published through illegal interception, surveillance and hacking can and should play in peacebuilding processes, as well as other domains like governance and humanitarian aid.
Linked to the vexed issue of licensing is also the data sourcing. Output and products that have as sources Big Data that is generated by Anonymous, Wikileaks and OpenLeaks risk being legally and politically persecuted and discriminated against in multiple jurisdictions, even if the value of including this information in analysis and visualisation augments progressive advocacy, activism and accountability in governance. This means that individuals in no way connected to or remotely associated with the group that published sensitive, classified information who go on to use this information in their analytics, visualisations and other output risk persecution and even prosecution. This is new terrain that is not covered by the laws and regulations of any one jurisdiction or set of existing guidelines. Apart from governments, intelligence services and the military – who are more strictly bound to data use and sourcing frameworks – once even previously classified information is in the public domain, someone, somewhere will find a use for it and make use of it. The very nature of the web is such that it doesn’t matter where the Big Data resides – it’s access, use and distribution can be global.
This has major implications for, inter alia, data secrecy laws, whistle-blowing protection and enabling legislation, Right to Information and Freedom of Information legislation, investigative journalism, national and global activism and indeed, the manner in which repressive regimes will themselves will use distributed intelligence and disparate datasets to strengthen their dictatorships. On the other hand, and as a countervailing force, how can Big and Open Data help with conflict transformation and peacebuilding?
One entry point to a brief exploration on the points enumerated above is the report Big Data, Big Impact: New Possibilities for International Development released by the World Economic Forum. The report correctly notes that,
“But now many policy-makers, corporate leaders and development experts are realising the potential applications, like the examples above, for the enormous amounts of data created by and about the individuals who use these services.”
The emphasis in the original suggests WEF is acutely aware that Big Data is not just about Big Science, but is increasingly if not already about individuals as producers and consumers of information. Individuals have inalienable rights and protection under the Universal Declaration of Human Rights and other domestic and international legal instruments. It is unclear how this extends to large datasets that aggregate the information they produce, which can even though anonymously or pseudonymously collated, directly impact their lives. A rights based approach to Big Data governance is missing. The WEF report hints at this when it notes that for a thriving data commons, end user empowerment and access are important, and that
“Individuals must have a moderate degree of financial literacy, affordable access to a mobile device, and a mobile network connection, in addition to control over their own information.”
The emphasis, which is not in the original, seeks to flag a central concern over Big Data today. The disturbing trend is that in fact, the control of information by its producers – citizens, individuals, people – is getting increasing rare. This is not necessarily a bad thing. An SMS can contribute to a much larger pool that goes on to influence a decision to send election monitors to a certain area, based on a crowdsourced map that shows hotspots, in order to mitigate election related violence. The sender of the SMS has no control over how her/his text is used, but sends it in good faith and with an implicit relinquishing of ownership when posted to the public domain through a hotline or shortcode. The flipside of course is that there is no guarantee that this information will be used only for social good, and not to promote hate and harm. Another problem is the lack of information expiration. Data generated for example in the run up to an election resides for months and years online, contributing of course to valuable regression analysis and violence modelling, but also as a source of information for violence by victors against other political opponents and parties. When combined with electoral registries, voting patters of specific areas and demographic information, Big Data can provide a tableau for planned violence over the longer term, ironically using the same datasets that could have contributed to its mitigation and avoidance in the short term. So while the WEF report notes that,
“When individuals have multiple SIM cards, it is impossible to aggregate data from each SIM back to the same individual. This data is most useful if it can be attached to demographic indicators, which allow the data to tell a story about the habits of a segment of the population. Improved methods of tying subscriptions to demographic information are needed to ensure data generated by mobile devices is as individualised as possible.”
it is unclear how such an information management regime will contribute to social and political stability in a complex political emergency or nascent democracy with competing faultlines.
Again, as the WEF report notes,
“The private sector maintains vast troves of transactional data, much of which is ‘data exhaust’, or data created as a by-product of other transactions. With the use of mobile phones, much of this data can be associated with individuals and their locations. The public sector in most countries also maintains enormous datasets in the form of census data, health indicators, and tax and expenditure information.”
WEF does not peg risk to this information in the public domain, and arguably, the benefits do outweigh any harmful use. However, the absence of any comprehensive “do no harm” framework for data management – from production to use – does raise the spectre of Big Data exacerbating violence, and disturbingly, in a geographically dispersed manner too, since public datasets aren’t necessarily governed sovereign boundaries. The availability and analysis of Big Data can have negative effects on the marketplace and service delivery as well. If for example average visits by citizens in a particular area to medical centres or hospitals is transposed over a map that shows population demographics, and other transportation layers and traffic patterns are added, the resulting picture is a comprehensive overview of the actual health of a community or region. This is information insurance companies can easily use to target specific communities, by raising premiums on for example health insurance, or in an area that shows a high incidence of pedestrian collisions and congestion, vehicle insurance. This same concern is shared by Steve Lohr, writing in the New York Times,
Data is tamed and understood using computer and mathematical models. These models, like metaphors in literature, are explanatory simplifications. They are useful for understanding, but they have their limits. A model might spot a correlation and draw a statistical inference that is unfair or discriminatory, based on online searches, affecting the products, bank loans and health insurance a person is offered, privacy advocates warn.
Logistics, transportation, medical and demographic information is useful in the public domain, but mashing them up can provide insights that can be then used invisibly against specific groups of people. This algorithmic discrimination is already occurring, and will only be fuelled by more and more data placed online. The availability of the data in aggregate will inform the decisions made on the more granular – Big Data in other words will increasingly impact small lives, and the assumption that those already vulnerable in society today stand to benefit from more data online needs to be critically questioned. Eli Pariser in a TED Talk notes that even if logged out, Google used fifty-seven indictors to personally tailor search query results. Imagine this level of machine-level discriminatory data use across multiple datasets, utterly beyond the control of any one individual or in fact, any one guardian of a data set, that in turn influences how we interact with and see our world.
It is a new paradigm of Big Data mashups for which the rules of fair play have not yet fully evolved.
Big Data licensing and appropriate use
A good definition of openness acts as a standard that ensures different open datasets are ‘interoperable’ and therefore do form a commons. Licensing is important because it reduces uncertainty. Without a license you don’t know where you, as a user, stand: when are you allowed to use this data? Are you allowed to give to others? To distribute your own changes, etc?
Together, a definition of openness, plus a set of conformant licenses deliver clarity and simplicity. Not only is interoperability ensured but people can know at a glance, and without having to go through a whole lot of legalese, what they are free to do.
Thus, licensing and definitions are important even though they are only a small part of the overall picture. If we get them wrong they will keep on getting in the way of everything else. If we get them right we can stop worrying about them and focus our full energies on other things. – Excerpt from Open Data Commons FAQ
The question of licensing Big Data is often overlooked as one that would only interest the geeks or lawyers. It is in fact one of the most under-studied and important determinants about the utility of Big Data. The case was most recently and eloquently made by Patrick Meier, the Director of Crisis Mapping & Partnerships at Ushahidi on the announcement by the World Bank that they were partnering with Google to “access Google Map Maker’s global mapping platform, allowing the collection, viewing, search and free access to data of geoinformation in over 150 countries and 60 languages.”. The concern however is over Google’s licensing agreement for the data that it harvests from citizens in those one hundred and fifty countries and in sixty languages. As Meier rightly cautions,
“Or is this about using citizen cartographers (as free labor?) for commercial purposes? Will Google push Map Maker data to Google Maps & Google Earth products, i.e., expanding market share & commercial interests? Contrast this with the World Bank’s Open Data for Resilience Initiative (OpenDRI), which uses open source software and open data to empower local communities and disaster risk managers.”
The significant danger is that the license agreement that underwrites this new partnership is gobbledegook for most who will see it as an opportunity to map themselves, their ideas, challenges and aspirations using Google’s tools. In other words, this is no ordinary agreements. These are two giants coming together to create an opportunity for participatory mapping that is unprecedented, and precisely because of this, focussed attention on what is a retrogressive licensing scheme that sadly plays into the worst fears and perceptions of the Bank in developing countries, and the worst concerns of Google as a corporate, parasitic entity. Meier’s blog post flags World Bank’s OpenDRI initiative, part of the Global Facility For Disaster Reduction And Recovery (GFDRR), a partnership of 39 countries and 8 international organisations committed to helping developing countries reduce their vulnerability to natural hazards and adapt to climate change. As OpenDRI notes,
An example of OpenDRI in action is haitidata.org. This website is built using a free and open source software tool, GeoNode, and makes risk assessment data produced following the 2010 Haiti earthquake available for anyone to download and use. This tool enables individuals and organizations to contribute data and download data, thereby enabling people to collaborate and utilise these data required to rebuild Haiti.
Note that data is not hostage to any single use case, platform, technology or tool. Also note for example the licensing of the toronto.ca/open initiative’s datasets. It’s introduction notes,
You are encouraged to Use and re-Use City of Toronto Datasets available under this Licence, freely and flexibly, with only a few conditions.
It goes on to note that the user / citizen is free to use the datasets to,
- copy, publish, distribute and transmit the Datasets;
- adapt the Datasets;
- exploit the Datasets commercially, for example, by combining it with other Datasets, or by including it in Your own products or applications.
Again, one notes the emphasis on the data commons instead of a particular platform or programme. Open Data Commons provides a very useful template on how Big Data can also be made Open Data through two licenses,
- Public Domain, which puts all your material in the ‘Public Domain’
- Share-Alike (plus Attribution), which like the GPL (or CC Attribution Share-Alike) requires public re-users of your data to share back changes (and attribute)
The issue of licensing clearly will deeply influence the manner in which Big Data is produced, archived, published, disseminated and actually used. How best to ensure a data commons could be encouraging a study that looks at how a ‘do no harm’ principle, from medicine, can be transposed to guide the ethics around the generation and use of Big Data. As noted online,
Nonmaleficence, which derives from the maxim, is one of the principal precepts of medical ethics that all medical students are taught in medical school and is a fundamental principle for emergency medical services around the world. Another way to state it is that “given an existing problem, it may be better not to do something, or even to do nothing, than to risk causing more harm than good.” It reminds the physician and other health care providers that they must consider the possible harm that any intervention might do.
Can we apply the ethics of medicine to also inform Big Data particularly in, related to or applicable in fragile States as well as those that are from, or can be used amongst vulnerable groups of society? Can we study how we can create risk metrics for Big Data that based on metadata, provide not just a new perspectives on a problem, but also flag within those perspectives the possible dangers of implementation and wider dissemination? Or is it the case that the world of mashups is so varied, that any technical implementation to assess risk of use is bound to fail? Should then the emphasis be on a set of ethical guidelines that bind the users of Big Datasets, across all datasets, to ensure their use and reference does not willingly bring harm to others? There are of course serious challenges to such an approach. Tim McNamara writing to the CrisisMappers Google Group recently raised concerns over the applicability of any ethical guidelines for a community as dispersed, diverse and disparate as Digital Humanitarians, or virtual volunteer groups increasingly present and influential in humanitarian domains. He notes,
- There are too many interests (humanitarian, political activism, revolutionary, anarchistic) to create a single ethical standard.
- Given there are so many interests, it’s likely that any oversight/coordination body will lack legitimacy by some of the players.
- Professional bodies’ ethics boards are predicated on the ability to exclude people from the profession. That doesn’t apply here.
- A code will miss every new entrant to this space. There is no induction process, therefore it will be hard to get buy-in.
Limiting, or even attempting to frame appropriate use of Big and Open Data faces similar challenges. A way out can be through the establishment of a forum that can recognise progressive uses of Big Data, and make them known, so that the recognition internationally inspires similar use cases. Big Data use can be gamed, and the winners of open calls can be called upon to scale up their initiatives, as promote it more widely. Rather than censure the initiatives that directly harm or contribute to violence, the emphasis could be on a platform for debate and discussion, not unlike the model of Creative Commons, that posits for consideration and emulation good examples, and the promise of greater recognition through a set of guidelines anchored to ‘do no harm’ principles, that maximise progressive use of Big and Open Data – recognising fully that the actual use of large datasets will often fall outside these guidelines, and that as with any media and public information, it will always be a contest between those who use information for good, and those who seek to wield it for parochial gain, hate and harm.
Ergo, licensing and appropriate use guidelines, along with in some cases the consideration of possible punitive measures against individuals and groups who use Big Data to harm and promote hate, need to be urgently discussed at national, regional and global levels. Even if a single, global compact and accepted framework isn’t the outcome of these discussions, there should at least be platforms – some of which could be driven by and located in the UN – for the debate of best practices, lessons identified and learnt and case studies over the access and use of Big Data.
DANGERS OF BIG DATA
The dangers of Big Data are less well explored than its potential. Again at its simplest, the availability of large datasets in the public domain does not mean that this information will be made use of, and when it is used, will be harnessed in a democratic or progressive manner. There is almost no rights based perspective on the production, dissemination, usage rights and archival of Big Data today. The animating assumption is that when data is published freely and openly, the angels will come. This may not be the case in fragile states, where even is information in the public domain is scarce, analysis of Big Data (e.g. social media, crowdsourced maps, surveillance of online and mobile communications) can help regressive regimes track and target communities and people they perceive as threats or inconvenient. Here, even aggregate information can place communities and specific individuals at risk. In November 2011, the Wall Street Journal published a remarkable story based on an undercover investigation which revealed,
“…a trove of 200-plus marketing documents, spanning 36 companies, [which included] hacking tools that enable governments to break into people’s computers and cellphones, and ‘massive intercept’ gear that can gather all Internet communications in a country.”
This too is Big Data, but it’s not exactly Open Data. And there’s the rub. Today’s championing of Big Data is often erroneously conflated with Open Data, the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. Surveillance on the scale made possible by the platforms flagged in the WSJ report can contribute to massive datasets that are extremely granular in nature, but it is unlikely they will ever be published as Open Data. Any large dataset placed on the Internet today is seen as progress. It can well be such, but one also needs to be sensitive to availability (is the data always where it is supposed to be?), access (how to get access as well as terms of access) and use (what if any restrictions govern the use of the data, and any output that uses it). The choice of platform, the means of access (availability of APIs) and the intricacies of data licensing in other words must inform a critical appreciation of Big Data. Other important considerations include timeliness and reliability of data in the public domain.
- The conflation of Big Data with Open Data is dangerous, for large datasets aren’t necessarily those that are open for use by anyone, and even those published on the web could have licensing restrictions that hold hostage any output using the source material to the discretion of the dataset’s owner.
- There is no guarantee that Big and Open Data will improve governance in fragile states. As the authors of The Adverse Effects of Sunshine: A Field Experiment on Legislative Transparency in an Authoritarian Assembly note,
An influential literature has demonstrated that transparency can improve the performance of political officials in democracies, particularly within elected legislatures. Although the direct causal link for this relationship runs through the incentives created by voters’ responses to the new information, recent donor projects have begun to export transparency interventions to authoritarian regimes under the assumption that NGOs and media can substitute for voters in these systems. These interventions, however, are at odds with an emerging literature, which argues that authoritarian parliaments primarily serve the role of cooptation and limited power-sharing, where complaints can be raised in a manner that does not threaten regime stability. Under these conditions, transparency may have perverse effects.
The authors go on to note that,
[The paper’s findings] should make us cautious about the export of transparency to authoritarian systems, where similarly-named institutions play vastly different roles in the polity… While we cannot say that increased transparency definitively decreases performance, as performance in legislative debates actually improved, we can say that its effects in non-democratic systems are nuanced and hard to predict. Caution should precede all legislative transparency programs in authoritarian countries.
In other words, the assumption that a more transparent, accountable government is the inevitably result from the continued and increased publication and use of Big and Open Data needs to be seriously questioned. It may well be the case for example that those who stand to benefit the most from massive data analytics in repressive regimes not vulnerable citizens or communities at risk of violence, but those in power.
- The loss of information will be felt more sharply in the age of Big Data. Big and Open Data exist on the web, and there is no guarantee that these information domains – increasingly used and useful – have necessary failsafe mechanisms for dataloss arising out of machine failure, software faults or malicious user interventions. As I have noted earlier, this raises a number of questions including,
- What are the technologies that can be used for cheap, reliable, easy to use local data storage that can create mini data centres for communities of users unable to afford comprehensive backup solutions of their own? Is there a case here for e-gov initiatives that actually promote backup solutions amongst citizens?
- What are the data standards that can be used to store information produced today by the machines that will replace PCs and mobiles phones 25 years hence?
- Social networking sites are information black-holes as well as rich in personal information. If a site goes out of business, so does the information. How can we prevent this?
- For organisations such as the UN and even large NGOs (as well as corporations) information management in an age where there is more produced than can be stored is a nightmare. The organisations I work with can’t even find what they are producing today, leave aside searching for and accessing information produced a few years ago. How can institutional memory survive in a context of inevitable information loss?
To not address the challenge of data reliability and redundancy (business continuity) when there is increasing reliance on Big Data sources is akin to driving blindfolded – it is an accident waiting to happen, the consequences of which can have ripple effects globally. Data loss can lead to real violence on the ground and the loss of irreplaceable data – like witness testimony, human rights violation mapping and cases and archives of crisis generated information. This is particularly pertinent for crisis information management platforms.
- As way back as in 2008, the British government said that it was “considering spending up to £12 billion on a database to monitor and store the internet browsing habits, e-mail and telephone records of everyone in Britain” and that “GCHQ, the government’s eavesdropping centre, has already been given up to £1 billion to finance the first stage of the project”. The resulting data is Big, but not Open. The possibility of data leaks cannot be ruled out. Corporations like Facebook, Google and Twitter today have petabytes of data. This information is mainly used for business analytics, but there are real and growing concerns over data privacy, ownership and access plus the use of this information that could potentially discriminate against individuals or particular sets of people, on the lines of geography, demographics, habits, likes, discursive and engagement patterns and other personal markers. As a World Economic Forum (WEF) report from January 2011 flags, “Fundamental questions about privacy, property, global governance, human rights – essentially around who should benefit from the products and services built upon personal data – are major uncertainties shaping the opportunity. “
- Linked to the previous point, the world of Big Data presents new, complex challenges that didn’t hitherto exist. For example, the aggregation, analysis and dissemination of information from disparate data sources can lead to more, not reduced risk. Big and Open Datasets, which are greater than the sum of their individual records, can lead to specific datasets being curtailed or censored, in turn affecting the whole system. For example, a dataset that plots demographics together with a dataset that plots electoral violence when mashed together can provide a powerful tool for repressive government to engineer pre and post-election violence in increasingly a sophisticated ways, using information in datasets that are quite useful to have in the public domain. The contest – not always progressive – between Big Data (aggregate information) and individual records (granular information) will grow.
- The same WEF report referenced in this section flags the emergence of a new ‘Asset Class’. And that is just the beginning. Increasing the control that individuals have over the manner in which their personal data is collected, managed and shared will spur a host of new services and applications. As some put it, personal data will be the new “oil” – a valuable resource of the 21st century. It will emerge as a new asset class touching all aspects of society. WEF does not see this new Asset Class as a risk, but unless there is fair and open access, non-discriminatory policies on the use of, and an equitable distribution of information around Big Data – the so-called new ‘oil’ of the local and global economy – the perception of the individuals, groups and companies that fall into this class could be as gatekeepers who firewall public domain information for parochial gain and profit. This can unleash social and political violence, that Big Data itself may be partly to blame for inciting, and the mere availability of may not be able to prevent.
- Calls for more transparent government in the West, which result in the release of ever increasing amounts of information to the public domain, can (and perhaps should) also include details of aid to other countries. What is on paper a very desirable idea may, not unlike Wikileaks, have unintended consequences on the ground. For example, the support to human rights organisations and activists in fragile states and emergent democracies could be part of the Big Data released by home governments. Repressive regimes and other actors can then use this information to correlate funding to dissident voices and groups, targeting them through campaigns of hate and harm. The interplay of fragile state politics and Big Data availability is complex, and also takes us back to the issue of licensing – where progressive use and dissemination in one context can lead to disastrous consequences in another. A do no harm framework needs to address Big Data governance, but is proving to be elusive to forge.
“So what do I think open data can achieve? Simply put: more effective assistance to disaster-affected communities. What sort of data do we need to achieve that? Equally simply: operationally relevant data.”
Big and Open data can substantially support better crisis information management and development cooperation, for instance in helping implementing UN OCHA’s Common and Fundamental Operational Datasets Registry. But in addition, open government data and open data initiatives, combined with or leveraged by crisis mapping and crowd sourcing can even further enhance aid efficiency and transparency and humanitarian operations. And even though as enumerated earlier in this paper, the initiative is riddled with challenges on data ownership and use, this point was eloquently underscored by Caroline Anstey, a Managing Director of the World Bank in a recent op-ed in the International Herald Tribune in January 2011.
Big Data will in the coming years just become part of the information we take for granted in our daily lives. What is big today will be dwarfed by what is produced by an increasing number of Earth’s inhabitants in the future, through devices as small as a mobile phone through to the output of big science that will exceed many times over global internet traffic today. It’s mind-boggling to think of the information out there even today. So much of it is lost to analysis, erased, forgotten, marginalised, and locked in. Through the initiatives and expertise of governments, NGOs and transnational corporations, information is being unlocked and publicly published. Citizens with new tools and devices are producing their own information. From the hyper-local to the national and international, this information is helping feed into not just journalism, but models of analysis and action. In sum, the production of information is changing as rapidly as its consumption. Knowledge creation remains a big challenge, and it is not one that will go away.
How to make sense of what we know, and know that we know so little, must be the humble foundations of initiatives that further champion Big and Open Data. Big Data is the story today. Tomorrow, it should be what narratives Big Data supports. The data itself is not the goal, rather, what is achieved through it. And what greater achievement than initiatives, platforms and tools that help save lives?
 In a related development, World Vision seems to have organised a workshop in 2011 to look at, inter alia, Do No Harm in crisismapping operations, which today generate a lot of data from multiple sources soon after the onset of a large-scale disaster. As Patrick Meier notes in a blog post (http://irevolution.net/2012/02/12/stranger-than-fiction), “the ICRC has taken the lead on drafting protocols relevant to a data 2.0 world in which volunteer networks and disaster-affected communities are increasingly digital”. While these protocols, which we can expect to see over 2012, are specifically geared to inform crisismapping, the datasets generated by volunteer networks and those affected by disaster on the ground, coupled with traditional UN system, home country government, regional organisational, international and satellite imagery inputs could also be governed by the same protocols, or extensions of the protocol that govern appropriate use.
 https://groups.google.com/forum/?fromgroups#!topic/crisismappers/nhh5l5zBoUs (you need prior registration with the Group to access topic threads)