深度学习DNA序列onehot方法

在利用深度学习模型分析DNA序列时,需要对DNA序列进行one hot encoding。以下是使用PyTorch对DNA序列进行One-hot编码的三种方法,并整合在一个代码中,同时计算不同方法处理128条DNA序列的时间消耗:

  1. import time
  2. import torch
  3. import torch.nn.functional as F
  4. import numpy as np
  5. # 定义字母与索引的映射关系
  6. mapping = {'A': 0, 'T': 1, 'C': 2, 'G': 3}
  7. # 定义DNA序列列表
  8. sequences = ['ATCG' * 250] * 128 # DNA序列长度为1000bp,共128条序列
  9.  
  10. # 方法一:torch.nn.functional.one_hot函数
  11. start_time = time.time()
  12. onehot_sequences1 = []
  13. for sequence in sequences:
  14. index_sequence = [mapping[base] for base in sequence]
  15. onehot_sequence = F.one_hot(torch.tensor(index_sequence), num_classes=4).float()
  16. onehot_sequences1.append(onehot_sequence)
  17. end_time = time.time()
  18. method1_time = end_time - start_time
  19. # 方法二:torch.eye函数
  20. start_time = time.time()
  21. onehot_matrix = torch.eye(4)
  22. onehot_sequences2 = []
  23. for sequence in sequences:
  24. index_sequence = [mapping[base] for base in sequence]
  25. onehot_sequence = onehot_matrix[index_sequence]
  26. onehot_sequences2.append(onehot_sequence)
  27. end_time = time.time()
  28. method2_time = end_time - start_time
  29. # 方法三:numpy进行转换
  30. start_time = time.time()
  31. onehot_matrix = np.eye(4)
  32. onehot_sequences3 = []
  33. for sequence in sequences:
  34. index_sequence = [mapping[base] for base in sequence]
  35. onehot_sequence = onehot_matrix[index_sequence]
  36. onehot_sequences3.append(onehot_sequence)
  37. onehot_sequences3 = torch.from_numpy(np.array(onehot_sequences3)).float()
  38. end_time = time.time()
  39. method3_time = end_time - start_time
  40. print("Method 1 time:", method1_time)
  41. print("Method 2 time:", method2_time)
  42. print("Method 3 time:", method3_time)

测试结果:

  1. Method 1 time: 0.09143757820129395
  2. Method 2 time: 0.02177143096923828
  3. Method 3 time: 0.035161733627319336

 

发表评论

匿名网友

拖动滑块以完成验证
加载失败