soft_search.label package#

Submodules#

soft_search.label.model_selection module#

soft_search.label.model_selection.fit_and_eval_all_models(test_size: float = 0.2, seed: int = 0, archive: bool = False, train_transformer: bool = True, push_transformer: bool = False) → DataFrame[source]#

soft_search.label.regex module#

soft_search.label.regex.label(df: DataFrame, apply_column: str = 'text', label_column: str = 'regex_match') → DataFrame[source]#

In-place add a new column to the provided pandas DataFrame with a label of software predicted or not solely based off a regex match for various software-like and adjacent terminology.

Parameters:

df: pd.DataFrame: The pandas DataFrame to in-place add a column with the regex matched software outcome labels.
apply_column: str: The column to use for “prediction”. Default: “text”
label_column: str: The name of the column to add with outcome “prediction”. Default: “regex_match”

Returns:

pd.DataFrame: The same pandas DataFrame but with a new column added in-place containing the software outcome “prediction”.

soft_search.label.semantic_logit module#

soft_search.label.semantic_logit.label() → None[source]#

soft_search.label.semantic_logit.train(train_df: str | Path | DataFrame, test_df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label', model_storage_path: str | Path = Path('/home/runner/work/eager/eager/soft-search-semantic-logit.pkl')) → Tuple[Path, Pipeline, EvaluationMetrics][source]#

soft_search.label.tfidf_logit module#

soft_search.label.tfidf_logit.label() → None[source]#

soft_search.label.tfidf_logit.train(train_df: str | Path | DataFrame, test_df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label', model_storage_path: str | Path = Path('/home/runner/work/eager/eager/soft-search-tfidf-logit-from-abstract.pkl')) → Tuple[Path, Pipeline, EvaluationMetrics][source]#

soft_search.label.transformer module#

soft_search.label.transformer.label(df: DataFrame, apply_column: str = 'abstract_text', label_column: str = 'transformer_label', model: str | Path = 'evamaxfield/soft-search') → DataFrame[source]#

In-place add a new column to the provided pandas DataFrame with a label of software predicted or not using a trained transformer model.

Parameters:

df: pd.DataFrame: The pandas DataFrame to in-place add a column with the software predicted outcome labels.
apply_column: str: The column to use for “prediction”. Default: “text”
label_column: str: The name of the column to add with outcome “prediction”. Default: “transformer_label”
model: Union[str, Path]: The path to the stored model. Default: https://huggingface.co/evamaxfield/soft-search (latest CI model)

Returns:

pd.DataFrame: The same pandas DataFrame but with a new column added in-place containing the software outcome prediction.

See also

soft_search.nsf.get_nsf_dataset: Function to get an NSF dataset for prediction.

Examples

Example application to a new NSF dataset.

>>> from soft_search import constants, nsf
>>> from soft_search.label import transformer
>>> df = nsf.get_nsf_dataset(
...     "2016-01-01",
...     "2017-01-01",
...     dataset_fields=[constants.NSFFields.abstractText],
... )
>>> predicted = transformer.label(
...     df,
...     apply_column=constants.NSFFields.abstractText,
... )

soft_search.label.transformer.train(train_df: str | Path | DataFrame, test_df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label', model_storage_path: str | Path = Path('/home/runner/work/eager/eager/soft-search-transformer'), base_model: str = 'distilbert-base-uncased-finetuned-sst-2-english', extra_training_args: Dict[str, Any] | None = None) → Tuple[Path, Trainer, TrainOutput, EvaluationMetrics][source]#

Fine-tune a transformer model to classify the provided labels.

This function will both train and evaluate the performance of the fine-tuned transformer.

Parameters:

train_df: Union[str, Path, pd.DataFrame]: The data to use for training. Only CSV file format is supported when providing a file path.
test_df: Union[str, Path, pd.DataFrame]: The data to use for training. Only CSV file format is supported when providing a file path.
text_col: str: The column name which contains the raw text. Default: “abstract_text”
label_col: str: The column name which contains the labels. Default: “label”
model_storage_path: Union[str, Path]: The path to store the model to. Default: “soft-search-transformer/”
base_model: str: The base model to fine-tune. Default: “distilbert-base-uncased-finetuned-sst-2-english”
extra_training_args: Optional[Dict[str, Any]]: Any extra arguments to pass to the Trainer object.

Returns:

Path: The path to the stored model.
Trainer: The Trainer object.
TrainOutput: The final output of the trainer.train() call.
EvaluationMetrics: The evaluation metrics.

Module contents#

Different software outcome labellers.

soft_search.label.load_tfidf_logit_for_prediction_from_abstract() → Pipeline[source]#

soft_search.label.load_tfidf_logit_for_prediction_from_outcomes() → Pipeline[source]#