about summary refs log tree commit diff
path: root/doc/file-formats.md
diff options
context:
space:
mode:
authorArun Isaac2025-08-04 12:52:39 +0100
committerArun Isaac2025-08-06 22:40:41 +0100
commitbcdb235949c06db07172b0c6355a0059436b86fb (patch)
treeef2a4ab4ea6d3da60a894b50eb7a2470021852f1 /doc/file-formats.md
parent2dc2efa7f77deb5ebcf7b80abefc162474614b2c (diff)
downloadpyhegp-bcdb235949c06db07172b0c6355a0059436b86fb.tar.gz
pyhegp-bcdb235949c06db07172b0c6355a0059436b86fb.tar.lz
pyhegp-bcdb235949c06db07172b0c6355a0059436b86fb.zip
Standardize file formats in the likeness of plink files.
* pyhegp/pyhegp.py: Import pandas.
(summary, pool, encrypt, cat): Use pandas data frames and new data
format.
* pyhegp/serialization.py: Import csv and pandas.
(Summary)[mean, std]: Delete fields.
[data]: New field.
(read_summary, write_summary, read_genotype, write_genotype): Use
pandas data frames and new data format.
* tests/test_serialization.py: Import column, columns and data_frames
from hypothesis.extra.pandas; pandas; negate from pyhegp.utils. Do not
import hypothesis.extra.numpy and approx from pytest.
(tabless_printable_ascii_text, chromosome_column, position_column,
reference_column, sample_names): New variables.
(summaries, genotype_reserved_column_name_p, genotype_frames): New
functions.
(test_read_write_summary_are_inverses): Use pandas data frames and new
data format.
(test_read_write_genotype_are_inverses): Use pandas for testing.
* doc/file-formats.md (File formats)[summary file]: Describe new
standard.
[genotype file]: New section.
* .guix/pyhegp-package.scm (pyhegp-package): Import python-pandas
from (gnu packages python-science).
(python-pyhegp)[propagated-inputs]: Add python-pandas.
* pyproject.toml (dependencies): Add pandas.
Diffstat (limited to 'doc/file-formats.md')
-rw-r--r--doc/file-formats.md13
1 files changed, 12 insertions, 1 deletions
diff --git a/doc/file-formats.md b/doc/file-formats.md
index 4d3bfcd..be8162f 100644
--- a/doc/file-formats.md
+++ b/doc/file-formats.md
@@ -5,7 +5,18 @@ The summary file is ASCII encoded. It consists of two sections—the header and
 
 The first line of the header section MUST be `# pyhegp summary file version 1`. Subsequent lines of the header section are a list of key-value pairs. Each line MUST be `#`, optional whitespace, the key, a single space character and then the value. The key MUST NOT contain whitespace or control characters, and MUST NOT begin with a `#` character. The value MAY contain whitespace characters, but MUST NOT contain control characters.
 
-The data section is a tab-separated table of numbers. The first line of the data section is a vector of means—one for each SNP. The second line is a vector of standard deviations—one for each SNP.
+The data section is a tab-separated table of numbers. The first line MUST be a header with column labels. Each row corresponds to one SNP. The columns labelled `chromosome`, `position`, `reference`, `mean` and `standard-deviation` contain the chromosome, the position of the SNP on the chromosome, the reference allele, the mean dosage and the standard deviation of the dosage for that SNP. Column labels are case-sensitive.
+
+The `reference` column is optional, and SHOULD be absent in pooled summary files.
 
 Here is an example summary file.
 `TODO: Add example.`
+
+## genotype file
+
+The genotype file is a tab-separated values (TSV) file. The first line MUST be a header with column labels. Each row corresponds to one SNP. The columns labelled `chromosome`, `position` and `reference` contain the chromosome, the position on the chromosome and the reference allele for that SNP. Other columns each contain dosage values for one sample. The headers of these columns MUST be their sample identifiers. Column headers are case-sensitive.
+
+the `reference` column is optional, and should be absent in encrypted genotype files.
+
+Here is an example genotype file.
+`TODO: Add example.`