pyhegp - Homomorphic encryption of genotypes and phenotypes

Age	Commit message (Collapse)	Author
5 days	Require that (chromosome, position) in genotype frames is unique. HEAD main	Arun Isaac
	Earlier, we required that (chromosome, position, reference) was unique. We tighten this restriction requiring (chromosome, position) be unique. Therefore, there can be only one reference allele at any given chromosome and position.
5 days	Restrict reference to single character.	Arun Isaac

5 days	Do not drop zero standard deviation SNPs with --only-center.	Arun Isaac

5 days	Remove unnecessary is_phenotype_metadata_column import.	Arun Isaac

5 days	Move SNP deletion out of encrypt_genotype function.	Arun Isaac

2026-01-16	Allow generation of block diagonal keys.	Arun Isaac
	Allow generation of block diagonal keys, and extend tests to test with different number of blocks.
2026-01-16	Add BlockDiagonalMatrix.	Arun Isaac

2026-01-16	Remove unnecessary imports from test_serialization.py.	Arun Isaac

2026-01-16	Add --only-center option.	Arun Isaac

2026-01-16	Separate centering from normalization.	Arun Isaac

2026-01-16	Add intercept column in phenotypes file.	Arun Isaac

2025-11-28	Handle absent optional reference column.	Arun Isaac
	pyhegp was crashing if the optional reference column was absent. We handle it correctly now. And, we add several test cases to catch this in the future.
2025-09-06	Catenate phenotype frames along the index.	Arun Isaac

2025-09-06	Split catenable phenotype frames along the index.	Arun Isaac
	Phenotype frames are split by sample IDs. This corresponds to splitting along the index, unlike genotype frames which need to be split along the columns.
2025-09-06	Generalize split_data_frame to split along any axis.	Arun Isaac

2025-09-06	Simplify split_data_frame so it is more composable.	Arun Isaac
	split_data_frame should only split the data frame. It should not be filtering out metadata columns.
2025-09-05	Generate unique SNPs in genotype frames without dropping duplicates.	Arun Isaac
	Earlier, we were generating unique SNPs in genotype frames by dropping duplicates. This meant we couldn't control the number of SNPs. Rejection sampling is also not an option because it is too expensive. So, we now generate unique SNPs directly, by first generating a list with unique elements and then converting to a data frame.
2025-09-05	Deduplicate genotype frame metadata generation.	Arun Isaac
	Abstract out generation of genotype frame metadata (namely chromosome, position and reference) from summaries and genotype_frames into a new helper function genotype_metadata.
2025-09-05	Drop SNPs with a zero standard deviation.	Arun Isaac

2025-09-04	Avoid wildcard import from helpers.strategies.	Arun Isaac

2025-09-04	Limit values in genotype and phenotype strategies.	Arun Isaac

2025-09-04	Test that ciphertext does not contain NA values.	Arun Isaac

2025-09-04	Parameterize number of samples in phenotype frame strategy.	Arun Isaac

2025-09-04	Parameterize number of samples in genotype frame strategy.	Arun Isaac

2025-09-04	Parameterize presence of reference column in genotype frame strategy.	Arun Isaac

2025-09-04	Add keys strategy.	Arun Isaac
	Add keys strategy, and use it.
2025-09-04	Compare complete frame in test_cat_*.	Arun Isaac
	It is so much simpler and much more robust to simply compare expected and actual data frames.
2025-09-04	Do not import unused settings from hypothesis.	Arun Isaac

2025-09-04	Test cat_phenotype.	Arun Isaac

2025-09-02	Rename cat subcommand to cat-genotype.	Arun Isaac
	A cat-phenotype subcommand is coming. Hence rename this.
2025-09-02	Add is_phenotype_metadata_column.	Arun Isaac
	Promote phenotype_reserved_column_name_p from helpers.strategies to is_phenotype_metadata_column in pyhegp.serialization.
2025-09-02	Drop duplicates in generated test phenotype frames.	Arun Isaac

2025-09-02	Merge, not concat, genotype frames.	Arun Isaac
	pd.concat duplicates the metadata columns, and is generally the wrong approach to the problem.
2025-09-02	Test cat_genotype.	Arun Isaac
	Test cat_genotype extensively using hypothesis.
2025-09-02	Add is_genotype_metadata_column.	Arun Isaac
	Promote genotype_reserved_column_name_p from helpers.strategies to is_genotype_metadata_column in pyhegp.serialization, and use it everywhere.
2025-09-02	Drop duplicates in generated test genotype frames.	Arun Isaac

2025-09-02	Move hypothesis strategies to separate file.	Arun Isaac
	These strategies may be used by other test modules as well.
2025-09-02	Suffix CLI subcommand functions with _command.	Arun Isaac
	We distinguish CLI subcommand functions using the _command suffix. This way, we don't have to concoct weird names for the actual workhorse functions. To remain consistent, we also suffix _command to the command testing functions.
2025-09-01	Do not require output ciphertext file path.	Arun Isaac
	Make output ciphertext file path implicit; infer it by appending ".hegp" to the plaintext file. We take inspiration from GnuPG.
2025-09-01	Use open method of Path object, rather than the open function.	Arun Isaac

2025-09-01	Test for existence of output files.	Arun Isaac
	We were testing for zero exit status. Now, in addition, we test for the existence of output files. This is slightly more robust.
2025-09-01	Add phenotype file format and serialization functions.	Arun Isaac

2025-08-06	Subset to common SNPs.	Arun Isaac
	* pyhegp/pyhegp.py: Import reduce from functools. (pool_summaries, encrypt_genotype): New functions. (pool): Use pool_summaries. (encrypt): Use encrypt_genotype. * tests/test_pyhegp.py: Import pandas; Summary, read_summary and read_genotype from pyhegp.serialization. (test_pool, test_encrypt): New tests. * test-data/encrypt-test-encrypted-genotype.tsv, test-data/encrypt-test-genotype.tsv, test-data/encrypt-test-key, test-data/encrypt-test-summary, test-data/pool-test-complete-summary, test-data/pool-test-summary1, test-data/pool-test-summary2: New files.
2025-08-06	Standardize key files.	Arun Isaac
	* doc/file-formats.md (File formats)[key file]: New section. * pyhegp/serialization.py: Import numpy. (read_key, write_key): New functions. * pyhegp/pyhegp.py: Import write_key from pyhegp.serialization. (encrypt): Use write_key. * tests/test_serialization.py: Import arrays and array_shapes from hypothesis.extra.numpy; approx from pytest; read_key and write_key from pyhegp.serialization. (test_read_write_key_are_inverses): New test.
2025-08-06	Compute summary on encryption if not provided.	Arun Isaac
	* pyhegp/pyhegp.py (genotype_summary): New function. (summary): Use genotype_summary. (encrypt): Compute summary if not provided. * tests/test_pyhegp.py (test_simple_workflow): Remove xfail mark.
2025-08-06	Add simple workflow.	Arun Isaac
	* README.md (How to use): Indent down into "Joint/federated analysis with many data owners" section. [Simple data sharing]: New section. * doc/generate-images.sh: Add simple workflow. * doc/workflow.png: Rename to doc/joint-workflow.png. * doc/workflow.uml: Rename to doc/joint-workflow.uml. * doc/simple-workflow.png, doc/simple-workflow.uml: New files. * tests/test_pyhegp.py: Import pytest. (test_simple_workflow): New test. * test-data/genotype.tsv: New file.
2025-08-06	Test joint workflow CLI.	Arun Isaac
	* tests/test_pyhegp.py: Import CliRunner from click.testing, and main from pyhegp.pyhegp. (test_joint_workflow): New test. * test-data/genotype0.tsv, test-data/genotype1.tsv, test-data/genotype2.tsv, test-data/genotype3.tsv: New files.
2025-08-06	Standardize file formats in the likeness of plink files.	Arun Isaac
	* pyhegp/pyhegp.py: Import pandas. (summary, pool, encrypt, cat): Use pandas data frames and new data format. * pyhegp/serialization.py: Import csv and pandas. (Summary)[mean, std]: Delete fields. [data]: New field. (read_summary, write_summary, read_genotype, write_genotype): Use pandas data frames and new data format. * tests/test_serialization.py: Import column, columns and data_frames from hypothesis.extra.pandas; pandas; negate from pyhegp.utils. Do not import hypothesis.extra.numpy and approx from pytest. (tabless_printable_ascii_text, chromosome_column, position_column, reference_column, sample_names): New variables. (summaries, genotype_reserved_column_name_p, genotype_frames): New functions. (test_read_write_summary_are_inverses): Use pandas data frames and new data format. (test_read_write_genotype_are_inverses): Use pandas for testing. * doc/file-formats.md (File formats)[summary file]: Describe new standard. [genotype file]: New section. * .guix/pyhegp-package.scm (pyhegp-package): Import python-pandas from (gnu packages python-science). (python-pyhegp)[propagated-inputs]: Add python-pandas. * pyproject.toml (dependencies): Add pandas.
2025-08-06	Move negate to pyhegp.utils.	Arun Isaac
	* tests/test_pyhegp.py (negate): Move to pyhegp.utils. Import negate from pyhegp.utils. * pyhegp/utils.py: New file.
2025-08-06	Loosen relative tolerance in test_pool_stats.	Arun Isaac
	* tests/test_pyhegp.py (test_pool_stats): Set relative tolerance to 1e-6.