Standardize file formats in the likeness of plink files.

* pyhegp/pyhegp.py: Import pandas. (summary, pool, encrypt, cat): Use pandas data frames and new data format. * pyhegp/serialization.py: Import csv and pandas. (Summary)[mean, std]: Delete fields. [data]: New field. (read_summary, write_summary, read_genotype, write_genotype): Use pandas data frames and new data format. * tests/test_serialization.py: Import column, columns and data_frames from hypothesis.extra.pandas; pandas; negate from pyhegp.utils. Do not import hypothesis.extra.numpy and approx from pytest. (tabless_printable_ascii_text, chromosome_column, position_column, reference_column, sample_names): New variables. (summaries, genotype_reserved_column_name_p, genotype_frames): New functions. (test_read_write_summary_are_inverses): Use pandas data frames and new data format. (test_read_write_genotype_are_inverses): Use pandas for testing. * doc/file-formats.md (File formats)[summary file]: Describe new standard. [genotype file]: New section. * .guix/pyhegp-package.scm (pyhegp-package): Import python-pandas from (gnu packages python-science). (python-pyhegp)[propagated-inputs]: Add python-pandas. * pyproject.toml (dependencies): Add pandas.
author: Arun Isaac 2025-08-04 12:52:39 +0100
committer: Arun Isaac 2025-08-06 22:40:41 +0100
commit: bcdb235949c06db07172b0c6355a0059436b86fb (patch)
tree: ef2a4ab4ea6d3da60a894b50eb7a2470021852f1 /doc/file-formats.md
parent: 2dc2efa7f77deb5ebcf7b80abefc162474614b2c (diff)
download: pyhegp-bcdb235949c06db07172b0c6355a0059436b86fb.tar.gz
pyhegp-bcdb235949c06db07172b0c6355a0059436b86fb.tar.lz
pyhegp-bcdb235949c06db07172b0c6355a0059436b86fb.zip
1 files changed, 12 insertions, 1 deletions
diff --git a/doc/file-formats.md b/doc/file-formats.md
index 4d3bfcd..be8162f 100644
--- a/doc/file-formats.md
+++ b/doc/file-formats.md
@@ -5,7 +5,18 @@ The summary file is ASCII encoded. It consists of two sections—the header and
 
 The first line of the header section MUST be `# pyhegp summary file version 1`. Subsequent lines of the header section are a list of key-value pairs. Each line MUST be `#`, optional whitespace, the key, a single space character and then the value. The key MUST NOT contain whitespace or control characters, and MUST NOT begin with a `#` character. The value MAY contain whitespace characters, but MUST NOT contain control characters.
 
-The data section is a tab-separated table of numbers. The first line of the data section is a vector of means—one for each SNP. The second line is a vector of standard deviations—one for each SNP.
+The data section is a tab-separated table of numbers. The first line MUST be a header with column labels. Each row corresponds to one SNP. The columns labelled `chromosome`, `position`, `reference`, `mean` and `standard-deviation` contain the chromosome, the position of the SNP on the chromosome, the reference allele, the mean dosage and the standard deviation of the dosage for that SNP. Column labels are case-sensitive.
+
+The `reference` column is optional, and SHOULD be absent in pooled summary files.
 
 Here is an example summary file.
 `TODO: Add example.`
+
+## genotype file
+
+The genotype file is a tab-separated values (TSV) file. The first line MUST be a header with column labels. Each row corresponds to one SNP. The columns labelled `chromosome`, `position` and `reference` contain the chromosome, the position on the chromosome and the reference allele for that SNP. Other columns each contain dosage values for one sample. The headers of these columns MUST be their sample identifiers. Column headers are case-sensitive.
+
+the `reference` column is optional, and should be absent in encrypted genotype files.
+
+Here is an example genotype file.
+`TODO: Add example.`
author	Arun Isaac	2025-08-04 12:52:39 +0100
committer	Arun Isaac	2025-08-06 22:40:41 +0100
commit	bcdb235949c06db07172b0c6355a0059436b86fb (patch)
tree	ef2a4ab4ea6d3da60a894b50eb7a2470021852f1 /doc/file-formats.md
parent	2dc2efa7f77deb5ebcf7b80abefc162474614b2c (diff)
download	pyhegp-bcdb235949c06db07172b0c6355a0059436b86fb.tar.gz pyhegp-bcdb235949c06db07172b0c6355a0059436b86fb.tar.lz pyhegp-bcdb235949c06db07172b0c6355a0059436b86fb.zip