FAT* 2018 Conference Notes, Day 2

Keynote 2: Deborah Hellman

Deborah Hellman of UVA Law starts the day off with a keynote on justice and fairness. She opens with a quote from Sidney Morgenbesser about what is unfair and what is unjust, asking if fairness is about treating everyone the same. She follows with a quote from Anatole France — “In its majestic equality, the law forbids rich and poor alike to sleep under bridges, beg in the streets and steal loaves of bread.” In practice, policies that formally treat everyone the same affect people in different ways.

Hypothesis 1: Treat like cases alike.
This hypothesis relies on choosing a proxy by which to classify people and decide how to treat them differently. That is, if treating everyone the same is unfair because of the situations they’re in lead to different outcomes, classify them into different cases based on their situations, and treat each case separately. This hypothesis seems to fall apart based on how the classifications are made and the intentions of those classifications in search of certain outcomes. This leads to the next hypothesis…

Hypothesis 2: It’s the thought that counts.
These traits are usually adopted for bad reasons. The classifications are made to impose differing treatments with moral decisions that are misguided or unjust. For example, an employer may avoid hiring women between the ages of 25 and 40 to avoid having to pay women who may have children to take care of. The goal is not to avoid employing women, but to increase productivity. The intent behind the classification is itself misguided or flawed.

Hypothesis 3: “Anti-Classification”
The use of classifications, in particular classifications based on certain traits e.g. race, gender, can lead to unintended effects and denigration.

Hypothesis 4: Bad Effects
Certain classifications themselves can compound injustice — for example, charging higher life insurance rates to battered women.

Hypothesis 5: Expressing Denigration
For example, saying “All teengaers must sit in the back of the bus” vs. “All blacks must sit in the back of the bus” express different ideas. Regardless of the intention, there is denigration inherent in the classification. She cites Justice Harlan’s dissent in Plessy v. Ferguson.

Indirect Discrimination and the Duty to Avoid Compounding Injustice
The Empty Idea of Equality
Even Imperfect Algorithms Can Improve the Criminal Justice System

Discussion: Cynthia Dwork

Session 3: Fairness in Computer Vision and NLP

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification (data)

Joy Buolamwini gives a talk on her now infamous paper on the poor performance of facial analysis technologies on non-white, non-male faces. She uses a more diverse dataset to benchmark various APIs. After reporting the poor performance to various companies, some actually improved their models to account for the underrepresented classes.

Analyze, Detect and Remove Gender Stereotyping from Bollywood Movies

Taneea Agrawaal presents her analysis of gender stereotyping in Bollywood movies. The analysis was done with a database of Bollywood movies going back intio the 1940s, along with movie trailers from the last decade and a few released movie scripts. Syntax analysis is done to extract verbs related to males and females to study the actions associated with each. She argues that the stories told and representations expressed in movies affect society’s perception of itself and subsequent actions. For example, Eat Pray Love caused an increase in solo female travel, and Brave and Hunger Games caused a sharp increase in female participation in archery.

Mixed Messages? The Limits of Automated Social Media Content Analysis

Natasha Duarte presents a talk focused on how NLP is being used to detect and flag content online for surveillance and law enforcement (for example, to detect and remove terrorist content from the internet). She argues that NLP tools are limited because they must be trained on domain-specific datasets to be effective in particular domains, and governments generally use pre-packaged solutions which are not designed for these domains. Manual human effort and language and context-specific work is necessary for any successful NLP system.

Session 4: Fair Classification

The cost of fairness in binary classification

Bob Williamson presents his research which frames adding fairness to binary classification as imposing a constraint. There must be a cost to this constraint, and Williamson presents a mathematical approach to measuring that cost.

Decoupled Classifiers for Group-Fair and Efficient Machine Learning

Nicole Immorlica shows that “training a separate classifier for each group (1) outperforms the optimal single classifier in both accuracy and fairness metrics, (2) and can be done in a black-box manner, thus leveraging existing code bases.” With the caveat that it “requires monotonic loss and access to sensitive attributes at classification time.”

A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions

Alexandra Chouldechova presents a case study in which a model was used to distill information about CPS cases to create risk scores to aid call center workers in case routing. She discusses some of the pitfalls of the model, and how improvements were made to address them along the way. She ends by emphasizing that this model is just one small black box which acts as one signal among many in a larger system of processes and decision-making.

Fairness in Machine Learning: Lessons from Political Philosophy

Reuben Binns takes a mix of philosophy and computer science to nudge the debate around ML fairness from “textbook”/legal definitions of fairness to one that goes back to more philosophical roots. It follows a trend at the conference of focusing on the context in which models are used, the moral goals and decisions of the models, and a re-analysis of concepts of fairness that the rest of the field may consider standard.

Session 5: FAT Recommenders, Etc.

Runaway Feedback Loops in Predictive Policing (code)

Carlos Scheidegger discusses a mathematical method, Polya Urns, that he’s used to discover feedback loops in PredPol. Such systems are based on a definition of fairness which states that areas with more crime should receive a higher allocation of police resources. He discusses the flaws of such methods and suggests some strategies to avoid these feedback loops.

All The Cool Kids, How Do They Fit In?: Popularity and Demographic Biases in Recommender Evaluation and Effectiveness (code)

Michael Ekstrand asks: Who receives what benefits in our recommender systems?

Recommendation Independence (code)

Toshihiro Kamishima

Balanced Neighborhoods for Multi-sided Fairness in Recommendation

Robin Burke

FAT* 2018 Conference Notes, Day 1

This weekend, I am at the Conference on Fairness, Accountability, and Transparency (FAT*) at NYU. This conference has been around for a few years in various forms — previously as a workshop at larger ML conferences — but has really grown into its own force, attracting researchers and practitioners from computer science, the social sciences, and law/policy fields. I will do my best to document the most interesting bits and pieces from each session below.

Keynote 1: Latanya Sweeney

Sweeney has an amazing tech+policy background in this field — the work she did on de-anonymization of “anonymized” data lead to the creation of HIPAA. She has also done interesting work on Discrimination in Online Ad Delivery (article). She argues that technology in a sense dictates the laws we live by. Her work has centered around specific case studies that point out the algorithmic flaws of technologies that seem normal and benign in our daily lives. Technical approaches include an “Exclusivity Index”, which takes a probabalistic approach to defining behavior that is anomalous in particular sub-groups. Two noted examples of unintended consequences of algorithms are discriminatory pricing algorithms in Airbnb and the leaking of location data through Facebook Messenger.

In the subsequent discussion with Jason Schultz, the focus is on laws and regulation. She states that there are 2000+ US privacy laws, but because they are so fragmented, they are rendered completely ineffective in comparison to blanket EU privacy laws. The case is made that EU laws have teeth, and in practice may raise the data privacy bar for users all over the world. She also stresses the need for work across groups, including technologists, advocacy groups, and policy makers. She presents a bleak view of the current landscape, but also presents reasons to be optimistic.

Session 1: Online Discrimination and Privacy

Potential for Discrimination in Online Targeted Advertising

Till Speicher presents a paper on the feasibility of various methods of using Facebook for discriminatory advertising. There are three methods presented:

Attribute-based targeting, which lets advertisers select certain traits of an audience they wish to target. These attributes can be official ones tracked by Facebook (~1100), or “free-form” attributes such as a user’s Likes.
PII-based targeting, which relies on public data such as voter records. Speicher takes NC voter records and is able to filter out certain groups by race, then re-upload the filtered voter data to create an audience.
“Look-alike” targeting, which takes an audience created from either of the above methods and scales it automatically — discrimination scaling as a service!
These methods make it clear how Facebook’s ad platform could be used to target and manipulate large groups of people. Speicher suggests that the the best methods to mitigate such efforts may be based on the outcome of targeting (i.e. focusing on who is targeted, rather than how).

Discrimination in Online Personalization: A Multidisciplinary Inquiry

Amit Datta and Jael Makagon present this study on how advertising can be used for discriminatory advertising (e.g. to target a specific gender for a job adversiting). See past work here: Automated Experiments on Ad Privacy Settings: A Tale of Opacity, Choice, and Discrimination. Jael has a law background, and walks the audience through different anti-discrimination laws and which parties may be held responsible in different scenarios. He describes a mess of laws that don’t quite apply to any party in the discrimination scenarios. Amit describes cases where advertisers can play active rather than passive roles in discriminatory advertising, and Jael describes the legal implications that can result from that.

They ultimately call out a “mismatch between responsibility and capability” in the advertising world, and they propose policy and technology-based changes that may be effective in preventing such discrimination.

Privacy for All: Ensuring Fair and Equitable Privacy Protections

Michael Ekstrand and Hoda Mehrpouyan ask “Is privacy fair?”. They start by discussing definitions of privacy, including:

Seclusion
Limitation
Non-intrusion
Control
Contextual integrity

Ekstrand argues that the tools we use to assess fairness of decision-making systems can be used to analyze privacy in systems. He raises three questions:

Are technical or non-technical privacy protection schemes fair?
When and how do privacy protection technologies or policies improve or impede the fairness of the systems they affect?
When and how do technologies or policies aimed at improving fairness enhance or reduce the privacy protections of the people involved?

They mention an example where Muslim taxi drivers are outed in anonymized NYC TLC data, and where James Comey’s personal Twitter account was discovered using public data. They discuss the cost of guarantees of privacy for certain schemes and definitions of privacy, and how that affects “fairness” for different definitons of fairness.

Relevant work:

Session 2: Interpretability and Explainability

“Meaningful Information” and the Right to Explanation

Andrew Selbst starts his talk asking why explainability is important, saying “what is inexplicable is unaccountable”. In his eyes, explainability brings a chain of decision-making that leads to accountability. He then explains some aspects of GDPR and asks if it contains an implicit “right to explanation” in some of its provisions. He cites current legal arguments that discuss whether or not such a right exists:

Notably, Selbst says that deep learning isn’t actually at risk of being banned, in particular becuase such a requirement is against completely automated systems, implying that deep learning systems are fine to use as long as they are just one factor in a larger explainable system with a human in the loop.

Interpretable Active Learning (code)

Richard Philips gives a talk on using LIME for active learning. By applying LIME to assess which features cause certainty in model classifications during active learning, their method can be used across populations to show if models are biased for or against certain subgroups.

Interventions over Predictions: Reframing the Ethical Debate for Actuarial Risk Assessment

Chelsea Barbaras argues that the debate around pre-trial risk assessment tools is shaped by old assumptions about the role risk assessment plays in these trials. Old risk-based systems considered factors that were drawn from social theories of criminal behavior at the time, that have since changed. They also focused on traits of the individual, which neglected to consider broader social factors in these cases. She also criticizes regression-based risk assessment in particular, due to the pitfalls of drawing conclusions from correlation vs. causation. She advocates for seeing risk not as a static thing to be predicted, but as a dynamic factor to be mitigated. She also discusses how we can use a causal framework of statistics and experiment design to ask better questions about risk assessment.

She also points to the recent work of Virginia Eubanks and Seth Prins:

Can we avoid reductionism in risk reduction?
An Investigation of the Causal Association between Changes in Social Relationships and Changes in Substance Use and Criminal Offending During the Transition from Adolescence to Adulthood

Tutorials 1

Quantifying and Reducing Gender Stereotypes in Word Embeddings

Understanding the Context and Consequences of Pre-trial Detention

21 Fairness Definitions and Their Politics

Arving Narayanan gives a “survey of various definitions of fairness and the arguments behind them” which can act as “‘trolley problems’ for fairness in ML”.

Algorithmic decision making and the cost of fairness
Rather than maximizing accruacy, the goal should be about “how to make algorithmic systems support human values”.

Group fairness — do outcomes systematically differ between demographic groups (or other population groups)?
- Fair prediction with disparate impact: A study of bias in recidivism prediction instruments
- “What do different stakehilders want of the binary classifier?”
  - Decisionmaker: “Of those I’ve labeled high-risk, how many will recidivate?” — Predictive value AKA Precision — equalized under Predictive parity
  - Defendant: “Whats the probability I’ll be incorrectly classified high-risk?” — False postive rate — equalized under Error rate balance
  - Society [hiring vs. criminal justice]: “Is the selected set demographically balanced?” — Selection probability — equalized under Demographic parity
- Different metrics matter to different stakeholders — no “right” metric.
Individual fairness — “equal thresholds” — generally impossible to pick a single threshold for all groups that equalizes both FPR and FNR
Utility: Algorithmic decision making and the cost of fairness
Tradeoffs:
- Between various measures of group fairness.
- Between group fairness and individual fairness.
- Between fairness and utility.
Tension between disparate treatment and disparate impact — finding creative case-by-case workarounds doesn’t “scale” for algorithmic decision making.
In training vs. classification: Does mitigating ML’s disparate impact require disparate treatment?
Ineffectiveness of “blindness” — Equality of Opportunity in Supervised Learning
- Bias is “just” a side effect of maximizing accuracy
- ML is great a picking up on proxies in data.
Unacknowledged affirmative action:
- Measurement bias, historical prejudice
- What is the problem to which fair machine learning is the solution?
Demographic parity assumes no intrinsic differences:
- An algorithm for removing sensitive information: application to race-independent recidivism prediction
Individual fairness: “Similar individuals should be treated similarly” — Fairness Through Awareness
Process fairness: The Case for Process Fairness in Learning: Feature Selection for Fair Decision Making
Diversity: Diversity in Big Data: A Review
Stereotype mirroring and exaggeration: Unequal Representation and Gender Stereotypes in Image Search Results for Occupations
- To what extent should ML models reflect societal stereotypes? Default view in tech world is that stereotype mirroring is “unbiased” and “correct”.
Dataset bias: Unbiased Look at Dataset Bias
Representations — should they be debiased?

Tutorials 2

Auditing Black Box Models

People Analytics and Employment Selection: Opportunities and Concerns

A Shared Lexicon for Research and Practice in Human-Centered Software Systems

Thoughts on the Potential of Open Data in Cities

The promise of Open Data has drawn most major US cities to implement some sort of program making city data available online and easily accessible to the general public. Citizen hackers, activists, news media, researchers, and more have all made use of the data in novel ways. However, these uses have largely been more information-based than action-based, and there remains work to be done in using Open Data to drive decisions in government and policy-making at all levels, from local to federal. Below I present some of the challenges and and opportunities available in making use of Open Data in more meaningful ways.

Challenges

Standardization and Organization

Open Data is dirty data. There is no set standard between different cities for how data should be formatted, and even similar datasets within a city are often not interoperable. Departments at all levels of government often act independently in publishing their data, so even if most datasets are available from the same repository (e.g. Socrata), their organization and quality can differ significantly. Without a cohesive set of standards between cities, it is difficult to adopt applications built for one city to others.

Automation

The way data is uploaded and made accessible must be improved. Datasets are often frozen and uploaded in bulk, so that when someone downloads a dataset, they download it for a particular period in time, and if they want newer data, they must either wait until it is released or find the bulk download for the newer data. This involves more human effort both in the process of uploading the data and in downloading and processing that data. Instead, new data should be made immediately accessible as a stream with old data going back as far in time as it is available. This allows someone to access exactly as much data as they need without the hassle of combing through multiple datasets, and it removes the curators need to constantly compile and update newer datasets.

Accessibility

Compared to the amount of data that the government stores, very little of it is digital and very little of what is digital is publicly available. The filing cabinet should not be a part of the government storage media. Making all data digital from the start makes it simpler to analyze and release. Finally, much of the data the government releases is in awkward formats such as XLSX and PDF that are not easily machine-readable. If the data is not readily available and easily accessible, it in effect does not exist.

Transparency

Most of the publicly accessible records that the government has are not readily available unless FOILed. The transparency argument of Open Data could be taken to a completely new level of depth and thoroughness if information at all levels of government was made readily available digitally as immediately as it was generated. Law enforcement records, public meetings, political records, judicial records, finance records, and any other operation of government that can be publicly audited by its people should be digitally available to the public from the moment it is entered into a government system.

Private Sector Data

Companies such as Uber and Airbnb have come to collect immense amounts of data on transportation and real estate that have historically fallen under regulated jurisdiction. Decisions should be reached with private companies to allow governments to access as much data as is necessary to ensure proper regulation of these utilities. This data should in turn be added to the public record along with official government data on these utilities.

Opportunities

Analytic Technologies

Policy-making should be actively informed by the nature of a constituency. Data-driven decision making is much hyped, but making it a reality requires software that easily and quickly gives decision-makers the information they need. From the city to the federal level, governments should have dashboards that summarize information on all aspects of citizens’ lives. These dashboards can contain information about traffic, pollution, crime, utilities, health, finance, education, and more. Lots of this data already exists within governments, and surely there exist some dashboards that analyze and visualize these properties individually, but to combine all available data on the population of a city can give significantly more insight into a decision than any one of these datasets alone.

Predictive Technologies

Governments have data going far back into history. Cities like New York have logged every service request for years, and that data is readily available digitally. Using the right statistical analysis on periodic data like heating requests, cities can start to predict which buildings might be at risk for heating violations in the winter, and can address such issues before they happen. The same can be applied to pot holes, graffiti, pollution issues and essentially any city-wide phenomena that might occur regularly. More precise preventative measures can be taken with more confidence, and eventually, the 311 call itself can be ruled out entirely.

Future Outlook

These ideas have the potential to radically change the way we engage with our cities and our politics. We can make decisions based unambiguously on what is happening in the world, and we can refine those decisions based on measured changes in the world over time. A population can know exactly if its citizens are getting healthier, safer, and smarter, and how to aid in these pursuits. Areas of governance that need more attention and potential approaches will become increasingly obvious as more information is combined and analyzed in meaningful ways. Decisions and their outcomes can be made with more confidence based on a more rigorous process. By making the most of Open Data, we can go beyond interesting information and begin to drive political action that directly benefits our cities, states, and nation.

A Note on Privacy

All of the ideas presented above have serious implications for the privacy of individuals and populations. These ideas have only considered the best-case uses of data in our society. Whether a government is analyzing granular data or data on a population in bulk, care must be taken to respect the privacy of its citizens. There is ongoing dialogue about how to balance data collection and privacy, and it is essential that governments and citizens take part in this dialogue as new technologies are developed and our societies become more data-driven.

Reinvent 311 Mobile Content Challenge: Homeless Helper NYC

NYC 311, with help from Stack Exchange, held the Reinvent 311 Mobile Content Challenge, which called on developers to use NYC’s revamped Open 311 Inquiry API to make city information more readily available on people’s mobile devices.

I started out focused on education data, but it was messy and too loosely organized to be of any immediate use. If you want to extract meaningful information from it, you could, but it would take some cleaning up and organizing to make useful. It isn’t as easy as displaying an API call from a mobile device.

After looking through more of the available data and consider the different use cases, I settled on an app designed to help homeless people in the city. This seemed like a terrible idea at first — how do you use technology to help those without access to it?

A few ideas came to mind. A map of food banks and soup kitchens (along with directions) could be useful. There were also lists of intake shelters online, but no coherent sets of shelters, so a map of shelters would also be useful. Finally, there were also information-based services that the city offered — information on food stamps, homeless prevention, youth counseling, job services, and other outreach information. Putting all of this data together wouldn’t be terribly difficult, and it would create an app that someone might find helpful. Homeless people typically don’t have access to smartphones, but outreach groups like Coalition for the Homeless could use technology to help those without it.

The app is modeled after the data from the API, with different objects for each type of API response. This allowed me to easily create maps give any set of facilities (shelters, food banks, etc.), and information pages given any city service. In this sense, the Open Inquiry API’s design has allowed for flexibility in adapting the app to easily include more data.

One of the main strengths of the Open 311 API is that once an app is created for one city, it should be seamlessly compatible with data from another city, since the API calls are all the same — all that changes is the city that serves up the data. This is a fantastic ideal outcome, but the implementation is slightly off, especially in this particular use case. The data I used needed minor refinements — nothing extreme, but I had to manually decide which services were relevant to this audience, since there is no “homeless” category. I also whipped up some quick Python scripts to add more useful latitude/longitude data from street addresses using the Google Maps Geocoding API.

Perhaps the biggest flaw in the data is the lack of information in API calls. One of the goals outlined by the people at NYC 311 was to reduce the number of calls to 311 asking for information that was readily available online. A mobile app is a great way to make this information more immediately available, but for most services, nothing more than a description was offered. The “More Information” and “FAQs” sections simply said “Call 311 for…” — the data isn’t useful if it simply redirects to the old method of calling for information.

Besides the lack of completeness and consistency in the city’s data, the potential for interesting tools and visualizations is clear. There’s a ton of data, the challenge is in sifting through it and making it easily usable.

The demo event itself was fantastic fun. The other contestants had impressive applications, most of which focused on low-income resources and real estate data. I met Joel Spolsky, who offered some sharp and honest advice for the contestants (think StackOverflow, but in person), Noel Hidalgo, who runs BetaNYC as a part of Code for America (check out their projects here), the talented people of StackExchange, designers from HUGE, and a gaggle of kind people from NYC 311. In the end, Homeless Helper NYC won a prize for “Best presentation of 311 information targeting a specific audience,” and it’s out now on the Play Store. You can also contribute to the source code on GitHub!