soft_search.label package#
Submodules#
soft_search.label.model_selection module#
soft_search.label.regex module#
- soft_search.label.regex.label(df: DataFrame, apply_column: str = 'text', label_column: str = 'regex_match') DataFrame [source]#
In-place add a new column to the provided pandas DataFrame with a label of software predicted or not solely based off a regex match for various software-like and adjacent terminology.
- Parameters:
- df: pd.DataFrame
The pandas DataFrame to in-place add a column with the regex matched software outcome labels.
- apply_column: str
The column to use for “prediction”. Default: “text”
- label_column: str
The name of the column to add with outcome “prediction”. Default: “regex_match”
- Returns:
- pd.DataFrame
The same pandas DataFrame but with a new column added in-place containing the software outcome “prediction”.
See also
soft_search.nsf.get_nsf_dataset
Function to get an NSF dataset for prediction.
- soft_search.label.regex.train(df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label') EvaluationMetrics [source]#
soft_search.label.semantic_logit module#
- soft_search.label.semantic_logit.train(train_df: str | Path | DataFrame, test_df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label', model_storage_path: str | Path = Path('/home/runner/work/eager/eager/soft-search-semantic-logit.pkl')) Tuple[Path, Pipeline, EvaluationMetrics] [source]#
soft_search.label.tfidf_logit module#
- soft_search.label.tfidf_logit.train(train_df: str | Path | DataFrame, test_df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label', model_storage_path: str | Path = Path('/home/runner/work/eager/eager/soft-search-tfidf-logit-from-abstract.pkl')) Tuple[Path, Pipeline, EvaluationMetrics] [source]#
soft_search.label.transformer module#
- soft_search.label.transformer.label(df: DataFrame, apply_column: str = 'abstract_text', label_column: str = 'transformer_label', model: str | Path = 'evamaxfield/soft-search') DataFrame [source]#
In-place add a new column to the provided pandas DataFrame with a label of software predicted or not using a trained transformer model.
- Parameters:
- df: pd.DataFrame
The pandas DataFrame to in-place add a column with the software predicted outcome labels.
- apply_column: str
The column to use for “prediction”. Default: “text”
- label_column: str
The name of the column to add with outcome “prediction”. Default: “transformer_label”
- model: Union[str, Path]
The path to the stored model. Default: https://huggingface.co/evamaxfield/soft-search (latest CI model)
- Returns:
- pd.DataFrame
The same pandas DataFrame but with a new column added in-place containing the software outcome prediction.
See also
soft_search.nsf.get_nsf_dataset
Function to get an NSF dataset for prediction.
Examples
Example application to a new NSF dataset.
>>> from soft_search import constants, nsf >>> from soft_search.label import transformer >>> df = nsf.get_nsf_dataset( ... "2016-01-01", ... "2017-01-01", ... dataset_fields=[constants.NSFFields.abstractText], ... ) >>> predicted = transformer.label( ... df, ... apply_column=constants.NSFFields.abstractText, ... )
- soft_search.label.transformer.train(train_df: str | Path | DataFrame, test_df: str | Path | DataFrame, text_col: str = 'abstract_text', label_col: str = 'label', model_storage_path: str | Path = Path('/home/runner/work/eager/eager/soft-search-transformer'), base_model: str = 'distilbert-base-uncased-finetuned-sst-2-english', extra_training_args: Dict[str, Any] | None = None) Tuple[Path, Trainer, TrainOutput, EvaluationMetrics] [source]#
Fine-tune a transformer model to classify the provided labels.
This function will both train and evaluate the performance of the fine-tuned transformer.
- Parameters:
- train_df: Union[str, Path, pd.DataFrame]
The data to use for training. Only CSV file format is supported when providing a file path.
- test_df: Union[str, Path, pd.DataFrame]
The data to use for training. Only CSV file format is supported when providing a file path.
- text_col: str
The column name which contains the raw text. Default: “abstract_text”
- label_col: str
The column name which contains the labels. Default: “label”
- model_storage_path: Union[str, Path]
The path to store the model to. Default: “soft-search-transformer/”
- base_model: str
The base model to fine-tune. Default: “distilbert-base-uncased-finetuned-sst-2-english”
- extra_training_args: Optional[Dict[str, Any]]
Any extra arguments to pass to the Trainer object.
- Returns:
- Path
The path to the stored model.
- Trainer
The Trainer object.
- TrainOutput
The final output of the trainer.train() call.
- EvaluationMetrics
The evaluation metrics.
See also
label
A function to apply a model across a pandas DataFrame.
Examples
Example training from supplied manually labelled data.
>>> from soft_search.data import load_joined_soft_search_2022 >>> from soft_search.label import transformer >>> from sklearn.model_selection import train_test_split >>> df = load_joined_soft_search_2022() >>> train, test = train_test_split( ... df, ... test_size=0.2, ... stratify=df["label"] ... ) >>> model = transformer.train(train)
Module contents#
Different software outcome labellers.