What are human values, and how do we align AI to them?
๐ Abstract
The article discusses the problem of aligning AI systems with human values. It proposes a process called Moral Graph Elicitation (MGE) to elicit and reconcile diverse human values into an alignment target for training language models. The key contributions are:
-
Defining six criteria an alignment target must satisfy to shape model behavior in accordance with human values: fine-grained, generalizable, scalable, robust, legitimate, and auditable.
-
Introducing the concept of "values cards" to represent human values as concrete attentional policies, and the "moral graph" as a data structure to reconcile these values.
-
Describing the MGE process to elicit values cards from people and construct the moral graph.
-
Reporting on a case study with 500 representative Americans, showing promising results across the six criteria.
๐ Q&A
[01] Defining Human Values and Aligning AI to Them
1. What are the key criteria an alignment target must satisfy to shape model behavior in accordance with human values? The article proposes six key criteria:
- Fine-grained: The alignment target should provide meaningful, context-specific guidance for model behavior.
- Generalizable: The elicited values should transfer well to previously unseen situations.
- Scalable: Wiser values should be obtained as more participants are added to the elicitation process.
- Robust: It should be hard for a resourceful third party to influence the target.
- Legitimate: The people affected by the model should recognize and endorse the values used to align the model.
- Auditable: The alignment target should be explorable and interpretable by human beings.
2. How does the article's proposed approach of "values cards" and "moral graph" aim to meet these criteria?
- Values cards represent human values as concrete "attentional policies" - the criteria people pay attention to when making meaningful choices. This makes the values fine-grained and auditable.
- The moral graph reconciles these values cards by capturing relationships of "wisdom" between them, allowing for context-specific guidance and generalizability.
- The process of eliciting and constructing the moral graph, called Moral Graph Elicitation (MGE), is designed to be scalable and robust against ideological manipulation.
3. How does the moral graph approach differ from other alignment proposals like reinforcement learning from human feedback (RLHF) and constitutional AI (CAI)?
- RLHF relies on comparisons from a small set of paid labelers, which lacks legitimacy and auditability.
- CAI uses high-level principles that are not fine-grained enough to provide clear guidance.
- The moral graph aims to be more fine-grained, generalizable, scalable, robust, legitimate, and auditable than these existing approaches.
[02] Moral Graph Elicitation (MGE) Process
1. How does the MGE process elicit values from people? The process uses a language model to interview participants about their values in particular contexts, drawing out "attentional policies" - the concrete criteria they pay attention to when making meaningful choices. These are then formatted into "values cards".
2. How does MGE construct the moral graph from the elicited values? The moral graph is a data structure that captures relationships of "wisdom" between values cards. Participants are shown stories of purported value transitions and asked if they think one value is wiser than another in a given context. This allows the "wisest" values to bubble up.
3. What are the key innovations in the MGE process compared to other value elicitation approaches? The key innovations are:
- Values cards that represent values as concrete attentional policies, rather than abstract words or slogans.
- The moral graph structure that captures relationships of wisdom between values, rather than just aggregating them.
- The use of story-based prompts to have participants evaluate the relative wisdom of values.
[03] Case Study Evaluation
1. What were the main findings from the case study with 500 representative Americans? The case study found that the moral graph produced by the MGE process showed promising results across the six criteria:
- Robustness: Participants endorsed the values they articulated, even if their initial responses were ideological.
- Fine-grainedness: The moral graph provides clear guidance on which values apply in which contexts.
- Generalizability: Participants were able to apply values from other scenarios to their chosen scenario.
- Scalability: Expertise and "wiser" values tended to rise to the top of the moral graph.
- Legitimacy: Participants felt the process was fair and representative of their values.
2. How did the moral graph perform compared to other approaches like CCAI? The article found that the moral graph outperformed CCAI on measures of legitimacy, such as participants feeling well-represented and respecting the other participants' views.
3. What are some of the limitations and future work discussed for the moral graph approach? Limitations include:
- The case study was limited to a US population, so it's unclear how well it generalizes to other cultures.
- The process requires significant time and compute from participants.
- There may be some contexts with irreconcilable value conflicts that the moral graph cannot resolve. Future work is needed to address these limitations and further evaluate the approach.