aepo package

Subpackages

Submodules

aepo.cli module

aepo.cli.aepo()
Parameters:
  • --cache_dir – the directory to cache the dataset.

  • --split – the split of the dataset to use.

  • --output – the path of the output file.

  • --num_instructions – the number of instructions.

  • --num_responses

    1. the maximum number of responses per instruction in the dataset.

  • --num_annotations

    1. the number of annotations available per instruction. To generate a pairwise preference dataset, set to 2.

  • --similarity_measure – the similarity measure to use for diverse MBR.

  • --diversity_penalty – the diversity penalty for diverse MBR.

  • --reward_model – the repository name in Huggingface hub of the reward model. Default is OpenAssistant/reward-model-deberta-v3-large-v2

  • --west_of_n – use the west-of-n strategy to generate the preference dataset.

  • --access_token – the read access token for the Huggingface API.

  • --use_sample_cache – use the cached sample dataset.

  • --use_matrix_cache – use the cached similarity matrix.

  • --debug – enable debug mode.

Returns:

None

The command line interface of AEPO.

aepo.preprocess module

aepo.preprocess.ds2csv(ds: Dataset, sample_dir: str, num_instructions: int = 4, num_responses: int = 32)

Convert the dataset to CSV files. :param ds: the annotation-efficient dataset. :type ds: datasets.Dataset :param sample_dir: the directory to save the CSV files. :type sample_dir: str :param num_instructions: the number of instructions. :type num_instructions: int :param num_responses: the number of responses per instruction we use for the AEPO. :type num_responses: int

Returns:

None

aepo.preprocess.read_dataset(file_path: str, split: str, access_token: str | None = None) Dataset
Parameters:
  • file_path (str) – the path or the repository name in Huggingface hub of the input dataset file.

  • split (str) – the split of the dataset to use.

  • access_token (str) – the read access token for the Huggingface API.

Returns:

the annotation-efficient dataset.

Return type:

datasets.Dataset

Read the dataset from a file or Huggingface Hub.

Module contents