Information Commissioner’s Office is the
largest independent authority in the UK set up to uphold informational rights
in the public interest, promoting openness by public bodies and data privacy
for individuals (Information Commissioner’s Office ICO, 2016).
It is predicted that by 2020, nearly 1.7
Megabytes of new information will be generated every second by every human
being on this planet. Most of this data will be available on the internet (Zwolenski
and Weatherill, 2014).
websites such as Twitter, Google, Facebook provides access to their data and
these data can be accessed with suitable permission from their sites (Ray,
2015). Not all websites provide such access to the users. The reason could be
that they don’t want the users to extract the content in their sites or it
could be because they don’t have enough technical support to provide access to
users. In such cases Web Scraping techniques comes into the picture.
Web Scraping is a collection of techniques and
methodologies for collecting information from the Internet. This is generally
done by computer software which replicates the web surfing made by humans to
collect information from different websites (Techopedia.com, 2018). Scraper
programs are used to extract specific information from your site search
results, datasets, product details etc. This information obtained, can be
reused for their own sites or for research and analysis or to make money by
selling the content to your competitor.
of Web Scraping
scraping technique led to various issues. The data present in the internet is
not equivalent to open data available for research purposes. This further
raises the issue of web scrapers practices in accessing the data available on
According to the EU/UK laws, there are several
rules and regulations associated with handling personal data. It is essential
for researchers to obtain permission in accessing personal data from the
The criticality of issues associated with scraping
techniques can be understood by considering a case study on social scraping an
automated capture of online data in social research (Marres and Weltevrede, 2013) from Ok Cupid.
Ok Cupid is an online dating site which runs a
proprietary algorithm to match users based on their profiles which includes
various personal information.
In 2016, a group of researchers from University of
Denmark created a dataset which had profile information of 70,000 users by
using web scraping techniques from Ok Cupid. Personal information such as age,
political views, sexual orientations etc were collected from Ok Cupid. The
created dataset was published by them onto an online forum named Open science
framework where the researchers are encouraged to share raw data to increase
collaboration and transparency across social science. The researchers never contacted
the company OK cupid or the users about using their data (Zimmer, 2018). This
incident has raised several legal and ethical concerns.
The major ethical issue associated with techniques that
results in accessing personal data is that the people involved may not want
their private information to be analyzed by others. The researchers did not
reveal the name of the users but the collected dataset contained data like User
ID by which the individuals can easily be tracked. Not all individuals who are
available on social networking sites are active. Everyone would have a different
motive. The users had an option to change their privacy settings to restrict
their profile visibility. Such profiles are not intended to be publicly visible
which the researchers failed to consider these factors into account while
publishing the dataset.
Companies which handle personal data should be aware of
all the rules and regulations associated with it. When a company is entrusted by
people with their personal details, the companies should understand they are
responsible for the resulting outcomes. When there is a breach in these
established rules, users are entitled to claim compensation from the company
through the court. This would result in serious loss to the company. The
reputation of the company is put to jeopardy as well. With increasing awareness
of lack of data security among the public, the company could lose potential new
clients and loss in business.
In the UK, there are certain rules under the Data
Protection Act (DPA) of 1998 which controls how personal information should be
used by organizations. There are 8 principles under DPA which explains how
personal data must be handled (ICO, 2018). Techniques like web scraping tends
to violate some of these principles. For example, the principle 6 of DPA states
that, personal data shall be processed in accordance with the rights of the
subject. As per this principle, the company is entitled to provide a valid
reason and description to its users when their personal data is given to any
other organization. OK Cupid is incapable of providing justification to its
users as why their personal data are made available in a different forum owing
due to web scrapping techniques.
What can be done
The organizations should create awareness among its
employees about such technologies and its effects. Increase in securities
measure among companies that work with the personal data of individuals. Using Completely
Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) can
be considered as the most effective way to prevent scrapers from accessing the
websites. Captcha consists of distorted texts which can be interpreted by
humans but not by web scraping algorithms.
When there is a
security breach it should immediately be reported to the ICO and minimizing the
effects by working as a team can work in resolving such issues. ICO can analyze
the situation and provide proper help to the organization to rectify the damage