Age | Commit message (Collapse) | Author |
|
split_data_frame should only split the data frame. It should not be
filtering out metadata columns.
|
|
Earlier, we were generating unique SNPs in genotype frames by dropping
duplicates. This meant we couldn't control the number of SNPs.
Rejection sampling is also not an option because it is too expensive.
So, we now generate unique SNPs directly, by first generating a list
with unique elements and then converting to a data frame.
|
|
Abstract out generation of genotype frame metadata (namely chromosome,
position and reference) from summaries and genotype_frames into a
new helper function genotype_metadata.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Add keys strategy, and use it.
|
|
It is so much simpler and much more robust to simply compare expected
and actual data frames.
|
|
|
|
|
|
A cat-phenotype subcommand is coming. Hence rename this.
|
|
Promote phenotype_reserved_column_name_p from helpers.strategies to
is_phenotype_metadata_column in pyhegp.serialization.
|
|
|
|
pd.concat duplicates the metadata columns, and is generally the wrong
approach to the problem.
|
|
Test cat_genotype extensively using hypothesis.
|
|
Promote genotype_reserved_column_name_p from helpers.strategies to
is_genotype_metadata_column in pyhegp.serialization, and use it
everywhere.
|
|
|
|
These strategies may be used by other test modules as well.
|
|
We distinguish CLI subcommand functions using the _command suffix.
This way, we don't have to concoct weird names for the actual
workhorse functions.
To remain consistent, we also suffix _command to the command testing
functions.
|
|
Make output ciphertext file path implicit; infer it by appending
".hegp" to the plaintext file. We take inspiration from GnuPG.
|
|
|
|
We were testing for zero exit status. Now, in addition, we test for
the existence of output files. This is slightly more robust.
|
|
|
|
* pyhegp/pyhegp.py: Import reduce from functools.
(pool_summaries, encrypt_genotype): New functions.
(pool): Use pool_summaries.
(encrypt): Use encrypt_genotype.
* tests/test_pyhegp.py: Import pandas; Summary, read_summary and
read_genotype from pyhegp.serialization.
(test_pool, test_encrypt): New tests.
* test-data/encrypt-test-encrypted-genotype.tsv,
test-data/encrypt-test-genotype.tsv, test-data/encrypt-test-key,
test-data/encrypt-test-summary, test-data/pool-test-complete-summary,
test-data/pool-test-summary1, test-data/pool-test-summary2: New files.
|
|
* doc/file-formats.md (File formats)[key file]: New section.
* pyhegp/serialization.py: Import numpy.
(read_key, write_key): New functions.
* pyhegp/pyhegp.py: Import write_key from pyhegp.serialization.
(encrypt): Use write_key.
* tests/test_serialization.py: Import arrays and array_shapes from
hypothesis.extra.numpy; approx from pytest; read_key and write_key
from pyhegp.serialization.
(test_read_write_key_are_inverses): New test.
|
|
* pyhegp/pyhegp.py (genotype_summary): New function.
(summary): Use genotype_summary.
(encrypt): Compute summary if not provided.
* tests/test_pyhegp.py (test_simple_workflow): Remove xfail mark.
|
|
* README.md (How to use): Indent down into "Joint/federated analysis
with many data owners" section.
[Simple data sharing]: New section.
* doc/generate-images.sh: Add simple workflow.
* doc/workflow.png: Rename to doc/joint-workflow.png.
* doc/workflow.uml: Rename to doc/joint-workflow.uml.
* doc/simple-workflow.png, doc/simple-workflow.uml: New files.
* tests/test_pyhegp.py: Import pytest.
(test_simple_workflow): New test.
* test-data/genotype.tsv: New file.
|
|
* tests/test_pyhegp.py: Import CliRunner from click.testing, and main
from pyhegp.pyhegp.
(test_joint_workflow): New test.
* test-data/genotype0.tsv, test-data/genotype1.tsv,
test-data/genotype2.tsv, test-data/genotype3.tsv: New files.
|
|
* pyhegp/pyhegp.py: Import pandas.
(summary, pool, encrypt, cat): Use pandas data frames and new data
format.
* pyhegp/serialization.py: Import csv and pandas.
(Summary)[mean, std]: Delete fields.
[data]: New field.
(read_summary, write_summary, read_genotype, write_genotype): Use
pandas data frames and new data format.
* tests/test_serialization.py: Import column, columns and data_frames
from hypothesis.extra.pandas; pandas; negate from pyhegp.utils. Do not
import hypothesis.extra.numpy and approx from pytest.
(tabless_printable_ascii_text, chromosome_column, position_column,
reference_column, sample_names): New variables.
(summaries, genotype_reserved_column_name_p, genotype_frames): New
functions.
(test_read_write_summary_are_inverses): Use pandas data frames and new
data format.
(test_read_write_genotype_are_inverses): Use pandas for testing.
* doc/file-formats.md (File formats)[summary file]: Describe new
standard.
[genotype file]: New section.
* .guix/pyhegp-package.scm (pyhegp-package): Import python-pandas
from (gnu packages python-science).
(python-pyhegp)[propagated-inputs]: Add python-pandas.
* pyproject.toml (dependencies): Add pandas.
|
|
* tests/test_pyhegp.py (negate): Move to pyhegp.utils.
Import negate from pyhegp.utils.
* pyhegp/utils.py: New file.
|
|
* tests/test_pyhegp.py (test_pool_stats): Set relative tolerance to
1e-6.
|
|
* tests/test_serialization.py: Import read_genotype and write_genotype
from pyhegp.serialization.
(test_read_write_genotype_are_inverses): New test.
|
|
* tests/test_pyhegp.py: Import math.
(square_matrices, negate, is_singular): New functions.
(test_conservation_of_solutions): New test.
|
|
* pyhegp/pyhegp.py (hegp_encrypt, hegp_decrypt): Do not standardize or
unstandardize.
(encrypt): Standardize before calling hegp_encrypt.
* tests/test_pyhegp.py (test_hegp_encryption_decryption_are_inverses):
Do not pass mean and standard deviation for standardization and
unstandardization.
|
|
* tests/test_pyhegp.py (test_hegp_encryption_decryption_are_inverses):
Do not test encryption on order 1 matrices.
|
|
* pyhegp/pyhegp.py (hegp_encrypt): Standardize before encryption.
(hegp_decrypt): Unstandardize after decryption.
(encrypt): Pass in mean and standard deviation from summary file to
hegp_encrypt.
* tests/test_pyhegp.py (test_hegp_encryption_decryption_are_inverses):
Pass in mean and standard deviation to hegp_encrypt.
|
|
* pyhegp/pyhegp.py (standardize): Standardize using mean and standard
deviation, instead of the minor allele frequency.
(unstandardize): New function.
* tests/test_pyhegp.py: Import standardize and unstandardize from
pyhegp.pyhegp.
(no_column_zero_standard_deviation): New function.
(test_standardize_unstandardize_are_inverses): New test.
|
|
* pyhegp/pyhegp.py: Import namedtuple from collections, and
read_summary from pyhegp.serialization.
(Stats): New type.
(pool_stats, pool): New functions.
* tests/test_pyhegp.py: Import Stats and pool_stats from
pyhegp.pyhegp.
(test_pool_stats): New test.
|
|
* doc/file-formats.md, pyhegp/serialization.py,
tests/test_serialization.py: New files.
|
|
It may be better to sample a smaller set of matrices finely than a
large set of matrices coarsely.
* tests/test_pyhegp.py (test_hegp_encryption_decryption_are_inverses):
Use default array shapes testing encryption/decryption.
|
|
* tests/test_pyhegp.py (test_hegp_encryption_decryption_are_inverses):
Reduce maximum matrix size to 100.
|
|
* pyhegp/__init__.py: New file.
* pyhegp.py: Move to pyhegp/pyhegp.py.
* test_pyhegp.py: Move to tests/test_pyhegp.py. Import from
pyhegp.pyhegp instead of from pyhegp.
* pyproject.toml (project.scripts)[pyhegp]: Switch to
pyhegp.pyhegp:main.
|