soft_search.data package#

Submodules#

soft_search.data.irr module#

soft_search.data.irr.calc_fleiss_kappa(data: str | Path | DataFrame) float[source]#

Calculate the Fleiss Kappa score as a metric for inter-rater reliability for the soft-search dataset.

Parameters:
data: Union[str, Path, pd.DataFrame]

The path to the dataset (as parquet) or an in-memory DataFrame.

Returns:
float

The kappa statistic for the data.

Notes

See interpretation of Fleiss Kappa Statistic: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900052/table/t3-biochem-med-22-3-276-4/?report=objectonly

soft_search.data.irr.print_irr_summary_stats(do_print: bool = True) float[source]#

Print useful statistics and summary stats using the stored inter-rater reliability data.

Prints: * Cohen’s Kappa Statistic for each potential model * Mean number of examples for each label between the two annotators * The rows which differ between the two annotators

Parameters:
do_print: bool

Should this function actually print the table Default: True (yes, print the table)

Returns:
float

The overall Fliess Kappa statistic.

soft_search.data.soft_search_2022 module#

class soft_search.data.soft_search_2022.SoftSearch2022DatasetFields[source]#

Bases: object

abstract_text = 'abstract_text'#
from_template_repo = 'from_template_repo'#
is_a_fork = 'is_a_fork'#
label = 'label'#
nsf_award_id = 'nsf_award_id'#
project_outcomes = 'project_outcomes'#
class soft_search.data.soft_search_2022.SoftSearch2022IRRDatasetFields[source]#

Bases: object

annotator = 'annotator'#
include_in_definition = 'include_in_definition'#
most_recent_commit_datetime = 'most_recent_commit_datetime'#
notes = 'notes'#
soft_search.data.soft_search_2022.load_github_repos_with_nsf_refs_2022() DataFrame[source]#

Load the GitHub repositories with references to NSF dataset.

Created via the get-github-repositories-with-nsf-ref bin script.

Returns:
pd.DataFrame

The dataset.

soft_search.data.soft_search_2022.load_linked_github_repositories_with_nsf_awards_2022() DataFrame[source]#

Load the GitHub repositories linked to specific NSF award IDs dataset.

Created via the find-nsf-award-ids-in-github-readmes-and-link bin script.

Returns:
pd.DataFrame

The dataset.

soft_search.data.soft_search_2022.load_soft_search_2022_training() DataFrame[source]#

Load the Software Search 2022 manually labelled dataset.

Returns:
pd.DataFrame

The dataset.

soft_search.data.soft_search_2022.load_soft_search_2022_training_irr() DataFrame[source]#

Load the Software Search 2022 Inter-Rater Reliability labelled dataset.

Returns:
pd.DataFrame

The dataset.

Module contents#

Stored datasets.

soft_search.data.load_soft_search_2022_training() DataFrame[source]#

Load the Software Search 2022 manually labelled dataset.

Returns:
pd.DataFrame

The dataset.

soft_search.data.load_soft_search_2022_training_irr() DataFrame[source]#

Load the Software Search 2022 Inter-Rater Reliability labelled dataset.

Returns:
pd.DataFrame

The dataset.