| Age | Commit message (Collapse) | Author |
|
Earlier, we required that (chromosome, position, reference) was
unique. We tighten this restriction requiring (chromosome, position)
be unique. Therefore, there can be only one reference allele at any
given chromosome and position.
|
|
|
|
|
|
|
|
|
|
Allow generation of block diagonal keys, and extend tests to test with
different number of blocks.
|
|
|
|
|
|
|
|
|
|
|
|
pyhegp was crashing if the optional reference column was absent. We
handle it correctly now. And, we add several test cases to catch this
in the future.
|
|
|
|
Phenotype frames are split by sample IDs. This corresponds to
splitting along the index, unlike genotype frames which need to be
split along the columns.
|
|
|
|
split_data_frame should only split the data frame. It should not be
filtering out metadata columns.
|
|
Earlier, we were generating unique SNPs in genotype frames by dropping
duplicates. This meant we couldn't control the number of SNPs.
Rejection sampling is also not an option because it is too expensive.
So, we now generate unique SNPs directly, by first generating a list
with unique elements and then converting to a data frame.
|
|
Abstract out generation of genotype frame metadata (namely chromosome,
position and reference) from summaries and genotype_frames into a
new helper function genotype_metadata.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Add keys strategy, and use it.
|
|
It is so much simpler and much more robust to simply compare expected
and actual data frames.
|
|
|
|
|
|
A cat-phenotype subcommand is coming. Hence rename this.
|
|
Promote phenotype_reserved_column_name_p from helpers.strategies to
is_phenotype_metadata_column in pyhegp.serialization.
|
|
|
|
pd.concat duplicates the metadata columns, and is generally the wrong
approach to the problem.
|
|
Test cat_genotype extensively using hypothesis.
|
|
Promote genotype_reserved_column_name_p from helpers.strategies to
is_genotype_metadata_column in pyhegp.serialization, and use it
everywhere.
|
|
|
|
These strategies may be used by other test modules as well.
|
|
We distinguish CLI subcommand functions using the _command suffix.
This way, we don't have to concoct weird names for the actual
workhorse functions.
To remain consistent, we also suffix _command to the command testing
functions.
|
|
Make output ciphertext file path implicit; infer it by appending
".hegp" to the plaintext file. We take inspiration from GnuPG.
|
|
|
|
We were testing for zero exit status. Now, in addition, we test for
the existence of output files. This is slightly more robust.
|
|
|
|
* pyhegp/pyhegp.py: Import reduce from functools.
(pool_summaries, encrypt_genotype): New functions.
(pool): Use pool_summaries.
(encrypt): Use encrypt_genotype.
* tests/test_pyhegp.py: Import pandas; Summary, read_summary and
read_genotype from pyhegp.serialization.
(test_pool, test_encrypt): New tests.
* test-data/encrypt-test-encrypted-genotype.tsv,
test-data/encrypt-test-genotype.tsv, test-data/encrypt-test-key,
test-data/encrypt-test-summary, test-data/pool-test-complete-summary,
test-data/pool-test-summary1, test-data/pool-test-summary2: New files.
|
|
* doc/file-formats.md (File formats)[key file]: New section.
* pyhegp/serialization.py: Import numpy.
(read_key, write_key): New functions.
* pyhegp/pyhegp.py: Import write_key from pyhegp.serialization.
(encrypt): Use write_key.
* tests/test_serialization.py: Import arrays and array_shapes from
hypothesis.extra.numpy; approx from pytest; read_key and write_key
from pyhegp.serialization.
(test_read_write_key_are_inverses): New test.
|
|
* pyhegp/pyhegp.py (genotype_summary): New function.
(summary): Use genotype_summary.
(encrypt): Compute summary if not provided.
* tests/test_pyhegp.py (test_simple_workflow): Remove xfail mark.
|
|
* README.md (How to use): Indent down into "Joint/federated analysis
with many data owners" section.
[Simple data sharing]: New section.
* doc/generate-images.sh: Add simple workflow.
* doc/workflow.png: Rename to doc/joint-workflow.png.
* doc/workflow.uml: Rename to doc/joint-workflow.uml.
* doc/simple-workflow.png, doc/simple-workflow.uml: New files.
* tests/test_pyhegp.py: Import pytest.
(test_simple_workflow): New test.
* test-data/genotype.tsv: New file.
|
|
* tests/test_pyhegp.py: Import CliRunner from click.testing, and main
from pyhegp.pyhegp.
(test_joint_workflow): New test.
* test-data/genotype0.tsv, test-data/genotype1.tsv,
test-data/genotype2.tsv, test-data/genotype3.tsv: New files.
|
|
* pyhegp/pyhegp.py: Import pandas.
(summary, pool, encrypt, cat): Use pandas data frames and new data
format.
* pyhegp/serialization.py: Import csv and pandas.
(Summary)[mean, std]: Delete fields.
[data]: New field.
(read_summary, write_summary, read_genotype, write_genotype): Use
pandas data frames and new data format.
* tests/test_serialization.py: Import column, columns and data_frames
from hypothesis.extra.pandas; pandas; negate from pyhegp.utils. Do not
import hypothesis.extra.numpy and approx from pytest.
(tabless_printable_ascii_text, chromosome_column, position_column,
reference_column, sample_names): New variables.
(summaries, genotype_reserved_column_name_p, genotype_frames): New
functions.
(test_read_write_summary_are_inverses): Use pandas data frames and new
data format.
(test_read_write_genotype_are_inverses): Use pandas for testing.
* doc/file-formats.md (File formats)[summary file]: Describe new
standard.
[genotype file]: New section.
* .guix/pyhegp-package.scm (pyhegp-package): Import python-pandas
from (gnu packages python-science).
(python-pyhegp)[propagated-inputs]: Add python-pandas.
* pyproject.toml (dependencies): Add pandas.
|
|
* tests/test_pyhegp.py (negate): Move to pyhegp.utils.
Import negate from pyhegp.utils.
* pyhegp/utils.py: New file.
|
|
* tests/test_pyhegp.py (test_pool_stats): Set relative tolerance to
1e-6.
|