bottle_john·2022년 4월 27일


목록 보기

🔍 Copy Number Variation

  • Germline copy number variants (CNVs)
  • somatic copy number alterations (SCNAs)

🖇 PipeLine / Workflows


=> automation of following processes

우선 input 으로 BAM 파일이 필요

< When matched normal sample, reference are available >

# From baits, and tumor / normal BAMs batch *Tumor.bam --normal *Normal.bam \
    --targets my_baits.bed --annotate refFlat.txt \
    --fasta hg19.fasta --access data/access-5kb-mappable.hg19.bed \
    --output-reference my_reference.cnn --output-dir results/ \
    --diagram --scatter
# Reusing a reference for additional samples batch *Tumor.bam -r Reference.cnn -d results/

# Reusing targets and antitargets to build a new reference, but no analysis batch -n *Normal.bam --output-reference new_reference.cnn \
    -t my_targets.bed -a my_antitargets.bed \
    -f hg19.fasta -g data/access-5kb-mappable.hg19.bed

추가적인 옵션은 이 두개이며 결과를 그래프로 시각화 해준다. 만일 원하지 않으면 따로 써주지 않아도 된다.

  • -diagram : diagram을 그려주는 옵션
  • -scatter : scatter plot을 그려주는 옵션
    => equivalent to independent processes access hg19.fa -o access.hg19.bed autobin *.bam -t baits.bed -g access.hg19.bed [--annotate refFlat.txt --short-names]

# For each sample... coverage Sample.bam -o Sample.targetcoverage.cnn coverage Sample.bam baits.antitarget.bed -o Sample.antitargetcoverage.cnn

# With all normal samples... reference *Normal.{,anti}targetcoverage.cnn --fasta hg19.fa -o my_reference.cnn

# For each tumor sample... fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn my_reference.cnn -o Sample.cnr segment Sample.cnr -o Sample.cns

# Optionally, with --scatter and --diagram scatter Sample.cnr -s Sample.cns -o Sample-scatter.pdf diagram Sample.cnr -s Sample.cns -o Sample-diagram.pdf

< When matched normal is not available > batch *Tumor.bam -n -t my_baits.bed -f hg19.fasta \
    --split --access data/access-5kb-mappable.hg19.bed \
    --output-reference my_flat_reference.cnn -d example2/
  • -t option; target, my_baits.bed 는 준비된 파일이어야하는지?
  • --split --access ; 'calculate the sequence-accessible coordinates' , 이건 cnvkit 폴더안에 있다, 근데 나는 hg38 이용예정, 이걸 써도 될지?
  • output-reference ; 생성된 FlatReference
  • -d ; output directory


Prepare a BED file of baited regions for use with CNVkit. target my_baits.bed --annotate refFlat.txt --split -o my_targets.bed
BED file -> baited genomic regions ~ maybe unequal size, --split option would divide to the size close to specified by --average-size

  • my baits.bed
  • refFlat.txt ; if BED file doesn't have short, informative names for each bait interval
  • -mytargets.bed


Calculate the sequence-accessible coordinates in chromosomes from the given reference genome, output as a BED file.

If your reference genome is the UCSC human genome hg19, a BED file of the sequencing-accessible regions is included in the CNVkit distribution as data/access-5kb-mappable.hg19.bed. If you’re not using hg19, consider building the “access” file yourself from your reference genome sequence (say, mm10.fasta) using the access command: access mm10.fasta -s 10000 -o access-10kb.mm10.bed
=> use this file in the next step to ensure off-target bins (“antitargets”) are allocated only in chromosomal regions that can be mapped.

if there are ohter known unmappable, variable, poorly sequenced regions that should be excluded access hg19.fa -x excludes.bed -o access-excludes.hg19.bed

  • hg38 과 같이 주어져있는 Referece file 이 있을때, 이 ref으로부터 interval list를 포함한 genomic coordinates 가 기록된 bed 파일이 access 명령어를 통해 만들어짐
~/GATK_best_practice/ access  /BiO/Share/Tools/gatk-bundle/hg38/Homo_sapiens_assembly38.fasta -o access.hg38.bed


Given "target" BED , derive BED file off-target/"antitarget" regions antitarget my_targets.bed -g data/access-5kb-mappable.hg19.bed -o my_antitargets.bed


quicly estimate read counts/ depth in BAM => estimate resonable on/off target bin size autobin *.bam -t my_targets.bed -g access.hg19.bed autobin *.bam -m amplicon -t my_targets.bed autobin *.bam -m wgs -b 50000 -g access.hg19.bed --annotate refFlat.txt

The BAM index (.bai) is used to quickly determine the total number of reads present in a file, and random sampling of targeted regions (-t) is used to estimate average on-target read depth much faster than the coverage command.


calculate coverage in given regions from read depth
pileup ~ - ountnone option coverage Sample.bam targets.bed -o Sample.targetcoverage.cnn coverage Sample.bam antitargets.bed -o Sample.antitargetcoverage.cnn


compile CN reference from given files, given reference genome ( -f) calculate GC content, repeat-masked proportion
( A reference should be constructed specifically for each target capture panel, using a BED file listing the genomic coordinates of the baited regions. Ideally, the control or “normal” samples used to build the reference should match the type of sample (e.g. FFPE-extracted or fresh DNA) and library preparation protocol or kit used for the test (e.g. tumor) samples.) reference *coverage.cnn -f ucsc.hg19.fa -o Reference.cnn

BUT 나는 paired/pooled normal 이 없다! => "Flat" reference of neutral copy number reference -o FlatReference.cnn -f ucsc.hg19.fa -t targets.bed -a antitargets.bed
1) extract copy number information from one / small # of tumor sample
2) create 'dummy' reference to use as input to batch command
3) evaluate suitablity for analysis by repeating CNVkit analysis


combine uncorrected target & antitarget coverage tables (.cnn), correct for biases in regional coverage and GC content fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn Reference.cnn -o Sample.cnr

how it works ; 
  A weight is assigned to each remaining bin depending on:

  1. The size of the bin;
  2. The deviation of the bin’s log2 value in the reference from 0;
  3. The “spread” of the bin in the reference.


infer discrete copy number segments from give coverage table segment Sample.cnr -o Sample.cns


Given segmented log2 ratio estimates (.cns), derive each segment’s absolute integer copy number using either: call Sample.cns -o call Sample.cns -y -m threshold -t=-1.1,-0.4,0.3,0.7 -o call Sample.cns -y -m clonal --purity 0.65 -o call Sample.cns -y -v Sample.vcf -m clonal --purity 0.7 -o call Sample.cns -y -v Sample.vcf -o

🗂 Review from Paper

Calculation of off-target intervals

  • antitarget command ; accepts a list of targeted regions,
  • Browser Extensible Data (BED) or GATK/Picard interval list format, and divides the off-target regions between each target into large bins

Estimation of copy number by read depth

Normalization of test samples to the reference

Segmentation and calling absolute copy number

Data summarization, reporting and visualization

Bioinformatician, Data Scientist & Analysis [지망생]

0개의 댓글