soft_search.label package#

Submodules#

soft_search.label.model_selection module#

soft_search.label.model_selection.fit_and_eval_all_models(test_size: float = 0.2, seed: int = 0, archive: bool = False, train_transformer: bool = True, push_transformer: bool = False) DataFrame[source]#

soft_search.label.regex module#

soft_search.label.regex.label(df: DataFrame, apply_column: str = 'text', label_column: str = 'regex_match') DataFrame[source]#

In-place add a new column to the provided pandas DataFrame with a label of software predicted or not solely based off a regex match for various software-like and adjacent terminology.

Parameters:
df: pd.DataFrame

The pandas DataFrame to in-place add a column with the regex matched software outcome labels.

apply_column: str

The column to use for “prediction”. Default: “text”

label_column: str

The name of the column to add with outcome “prediction”. Default: “regex_match”

Returns:
pd.DataFrame

The same pandas DataFrame but with a new column added in-place containing the software outcome “prediction”.

See also

soft_search.nsf.get_nsf_dataset

Function to get an NSF dataset for prediction.

soft_search.label.regex.train(df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label') EvaluationMetrics[source]#

soft_search.label.semantic_logit module#

soft_search.label.semantic_logit.label() None[source]#
soft_search.label.semantic_logit.train(train_df: str | Path | DataFrame, test_df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label', model_storage_path: str | Path = Path('/home/runner/work/eager/eager/soft-search-semantic-logit.pkl')) Tuple[Path, Pipeline, EvaluationMetrics][source]#

soft_search.label.tfidf_logit module#

soft_search.label.tfidf_logit.label() None[source]#
soft_search.label.tfidf_logit.train(train_df: str | Path | DataFrame, test_df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label', model_storage_path: str | Path = Path('/home/runner/work/eager/eager/soft-search-tfidf-logit-from-abstract.pkl')) Tuple[Path, Pipeline, EvaluationMetrics][source]#

soft_search.label.transformer module#

soft_search.label.transformer.label(df: DataFrame, apply_column: str = 'abstract_text', label_column: str = 'transformer_label', model: str | Path = 'evamaxfield/soft-search') DataFrame[source]#

In-place add a new column to the provided pandas DataFrame with a label of software predicted or not using a trained transformer model.

Parameters:
df: pd.DataFrame

The pandas DataFrame to in-place add a column with the software predicted outcome labels.

apply_column: str

The column to use for “prediction”. Default: “text”

label_column: str

The name of the column to add with outcome “prediction”. Default: “transformer_label”

model: Union[str, Path]

The path to the stored model. Default: https://huggingface.co/evamaxfield/soft-search (latest CI model)

Returns:
pd.DataFrame

The same pandas DataFrame but with a new column added in-place containing the software outcome prediction.

See also

soft_search.nsf.get_nsf_dataset

Function to get an NSF dataset for prediction.

Examples

Example application to a new NSF dataset.

>>> from soft_search import constants, nsf
>>> from soft_search.label import transformer
>>> df = nsf.get_nsf_dataset(
...     "2016-01-01",
...     "2017-01-01",
...     dataset_fields=[constants.NSFFields.abstractText],
... )
>>> predicted = transformer.label(
...     df,
...     apply_column=constants.NSFFields.abstractText,
... )
soft_search.label.transformer.train(train_df: str | Path | DataFrame, test_df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label', model_storage_path: str | Path = Path('/home/runner/work/eager/eager/soft-search-transformer'), base_model: str = 'distilbert-base-uncased-finetuned-sst-2-english', extra_training_args: Dict[str, Any] | None = None) Tuple[Path, Trainer, TrainOutput, EvaluationMetrics][source]#

Fine-tune a transformer model to classify the provided labels.

This function will both train and evaluate the performance of the fine-tuned transformer.

Parameters:
train_df: Union[str, Path, pd.DataFrame]

The data to use for training. Only CSV file format is supported when providing a file path.

test_df: Union[str, Path, pd.DataFrame]

The data to use for training. Only CSV file format is supported when providing a file path.

text_col: str

The column name which contains the raw text. Default: “abstract_text”

label_col: str

The column name which contains the labels. Default: “label”

model_storage_path: Union[str, Path]

The path to store the model to. Default: “soft-search-transformer/”

base_model: str

The base model to fine-tune. Default: “distilbert-base-uncased-finetuned-sst-2-english”

extra_training_args: Optional[Dict[str, Any]]

Any extra arguments to pass to the Trainer object.

Returns:
Path

The path to the stored model.

Trainer

The Trainer object.

TrainOutput

The final output of the trainer.train() call.

EvaluationMetrics

The evaluation metrics.

See also

label

A function to apply a model across a pandas DataFrame.

Examples

Example training from supplied manually labelled data.

>>> from soft_search.data import load_joined_soft_search_2022
>>> from soft_search.label import transformer
>>> from sklearn.model_selection import train_test_split
>>> df = load_joined_soft_search_2022()
>>> train, test = train_test_split(
...     df,
...     test_size=0.2,
...     stratify=df["label"]
... )
>>> model = transformer.train(train)

Module contents#

Different software outcome labellers.

soft_search.label.load_tfidf_logit_for_prediction_from_abstract() Pipeline[source]#
soft_search.label.load_tfidf_logit_for_prediction_from_outcomes() Pipeline[source]#