magic starSummarize by Aili

Beyond Preferences in AI Alignment

๐ŸŒˆ Abstract

The article discusses the limitations of the dominant preferentist approach to AI alignment, which assumes that (1) preferences are an adequate representation of human values, (2) human rationality can be understood in terms of maximizing the satisfaction of preferences, and (3) AI systems should be aligned with the preferences of one or more humans. The authors argue that these assumptions are problematic and propose conceptual and technical alternatives for AI alignment.

๐Ÿ™‹ Q&A

[01] Rational Choice Theory as a Descriptive Framework

1. What are the key issues with using rational choice theory as a descriptive model of human behavior and decision-making?

  • Rational choice theory assumes that human behavior can be modeled as the (approximate) maximization of expected utility, with preferences represented as utility or reward functions. However, this fails to account for:
    • Systematic deviations from optimality exhibited by humans due to bounded rationality, satisficing behavior, and the use of heuristics.
    • The inability of utility functions to capture the thick semantic content of human values and the possible incommensurability of those values.

2. How can resource rationality provide a more flexible framework for modeling the relationship between preferences and behavior?

  • Resource rationality posits that seemingly irrational human behavior can often be understood as arising from the rational use of limited computational resources. This provides a generative principle for hypothesizing possible deviations from standard rationality, and then testing whether such deviations occur in humans.
  • Embedding resource rationality priors in probabilistic models of human decision-making can enable systems to infer human goals and preferences from failed plans and mistaken reasoning, while placing greater evidential weight on decisions made after lengthier deliberation.

[02] Beyond Expected Utility Theory as a Normative Standard

1. What are the key limitations of using expected utility theory (EUT) as a normative standard for rational agency?

  • Coherence arguments for EUT, such as money pump arguments, are not as strong as often assumed. Rational agents need not comply with the axioms of EUT, especially given computational and practical limitations.
  • EUT alone does not provide much informative content about the likely goals and behaviors of advanced AI systems. Many kinds of behavior can trivially be described in terms of utility maximization.
  • Alternative analytical lenses, such as mechanistic, economic, evolutionary, and resource-rational analyses, may be more promising for understanding and aligning advanced AI systems.

2. How can the design of AI systems move beyond globally coherent agents that comply with EUT?

  • Building AI systems as "tools" with locally coherent preferences, rather than globally coherent utility maximizers, can better respect human value pluralism and avoid problematic incentives like context manipulation.
  • Locally complete preferences, where trajectories are only comparable within fixed context schedules but incomparable across contexts, can maintain tool-like locality despite global scope.

[03] Beyond Single-Principal Alignment as Preference Matching

1. Why are reward learning methods like RLHF limited in their ability to align AI systems with human values?

  • Reward learning methods suffer from the same representational limits as reward functions and utility functions more broadly. They cannot adequately capture the dynamic, socially constructed, and potentially incomplete nature of human preferences.
  • Optimizing a learned reward function is only appropriate for sufficiently narrow and bounded AI systems. It is inadequate for more ambitious, globally-scoped AI assistants that need to be aligned with the underlying values that generate human preferences.

2. What would be required to align AI systems that operate across contexts and over extended periods of time?

  • AI systems need to learn how each person's preferences are dynamically constructed, and be aligned to the underlying values that generate those preferences, rather than just the preferences themselves.
  • When preferences are incomplete or conflict across time, AI systems should be aligned with normative ideals about how to assist in such situations, rather than just preference matching.
  • Contextual preference models and methods for reasoning about norms and values (beyond just preferences) will be necessary for this more ambitious form of alignment.

</output_format>

Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.