soft_search.data package#

Submodules#

soft_search.data.irr module#

soft_search.data.irr.calc_fleiss_kappa(data: str | Path | DataFrame) → float[source]#

Calculate the Fleiss Kappa score as a metric for inter-rater reliability for the soft-search dataset.

Parameters:

data: Union[str, Path, pd.DataFrame]: The path to the dataset (as parquet) or an in-memory DataFrame.

Returns:

float: The kappa statistic for the data.

soft_search.data.soft_search_2022 module#

class soft_search.data.soft_search_2022.SoftSearch2022DatasetFields[source]#

Bases: object

abstract_text = 'abstract_text'#

from_template_repo = 'from_template_repo'#

github_link = 'github_link'#

is_a_fork = 'is_a_fork'#

label = 'label'#

nsf_award_id = 'nsf_award_id'#

nsf_award_link = 'nsf_award_link'#

project_outcomes = 'project_outcomes'#

class soft_search.data.soft_search_2022.SoftSearch2022IRRDatasetFields[source]#

Bases: object

annotator = 'annotator'#

github_link = 'github_link'#

include_in_definition = 'include_in_definition'#

most_recent_commit_datetime = 'most_recent_commit_datetime'#

notes = 'notes'#

soft_search.data.soft_search_2022.load_github_repos_with_nsf_refs_2022() → DataFrame[source]#

Load the GitHub repositories with references to NSF dataset.

Created via the get-github-repositories-with-nsf-ref bin script.

Returns:

pd.DataFrame: The dataset.

soft_search.data.soft_search_2022.load_linked_github_repositories_with_nsf_awards_2022() → DataFrame[source]#

Load the GitHub repositories linked to specific NSF award IDs dataset.

Created via the find-nsf-award-ids-in-github-readmes-and-link bin script.

Returns:

pd.DataFrame: The dataset.

soft_search.data.soft_search_2022.load_soft_search_2022_training() → DataFrame[source]#

Load the Software Search 2022 manually labelled dataset.

Returns:

pd.DataFrame: The dataset.

soft_search.data.soft_search_2022.load_soft_search_2022_training_irr() → DataFrame[source]#

Load the Software Search 2022 Inter-Rater Reliability labelled dataset.

Returns:

pd.DataFrame: The dataset.

Module contents#

Stored datasets.

soft_search.data.load_soft_search_2022_training() → DataFrame[source]#

Load the Software Search 2022 manually labelled dataset.

Returns:

pd.DataFrame: The dataset.

soft_search.data.load_soft_search_2022_training_irr() → DataFrame[source]#

Load the Software Search 2022 Inter-Rater Reliability labelled dataset.

Returns:

pd.DataFrame: The dataset.