속도를 위해 문자열 목록 구문 분석

Nov 30 2020

배경

get_player_path문자열 목록 player_file_list과 int 값 을받는 함수가 있습니다 total_players. 예를 들어 문자열 목록을 줄이고 int 값을 매우 작은 수로 설정했습니다.

의 각 문자열 player_file_list에는 year-date/player_id/some_random_file.file_extension또는year-date/player_id/IDATs/some_random_number/some_random_file.file_extension

발행물

여기서 내가 본질적으로 달성하려는 것은이 목록을 year-date/player_id살펴보고 길이가 값에 도달 할 때까지 모든 고유 경로를 집합에 저장 하는 것입니다.total_players

내 현재 접근 방식이 나에게 가장 효율적이지 않은 것 같고 어쨌든 내 기능의 속도를 높일 수 있는지 궁금 합니다.?get_player_path

암호

def get_player_path(player_file_list, total_players):
    player_files_to_process = set()
    for player_file in player_file_list:
        player_file = player_file.split("/")
        file_path = f"{player_file[0]}/{player_file[1]}/"
        player_files_to_process.add(file_path)
        if len(player_files_to_process) == total_players:
            break
    return sorted(player_files_to_process)


player_file_list = [
    "2020-10-27/31001804320549/31001804320549.json",
    "2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
    "2020-10-28/31001804320548/31001804320549.json",
    "2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
    "2020-10-29/31001804320547/31001804320549.json",
    "2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
    "2020-10-30/31001804320546/31001804320549.json",
    "2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
    "2020-10-31/31001804320545/31001804320549.json",
    "2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]

print(get_player_path(player_file_list, 2))

산출

['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']

답변

PedroLobito Nov 30 2020 at 21:08

더 개선 될 수있는이 솔루션을 여기에 남겨 두겠습니다. 도움이되기를 바랍니다.

player_file_list = (
    "2020-10-27/31001804320549/31001804320549.json",
    "2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
    "2020-10-28/31001804320548/31001804320549.json",
    "2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
    "2020-10-29/31001804320547/31001804320549.json",
    "2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
    "2020-10-30/31001804320546/31001804320549.json",
    "2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
    "2020-10-31/31001804320545/31001804320549.json",
    "2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)

def get_player_path(l, n):
  pfl = set()
  for i in l:
    i = "/".join(i.split("/")[0:2])
    if i not in pfl:
      pfl.add(i)
    if len(pfl) == n:
      return pfl

  
  if n > len(pfl):
    print("not enough matches")
    return

print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}

Python 데모

1 janluke Nov 30 2020 at 22:25

먼저 함수를 분석해 보겠습니다.

경로 길이가 상대적으로 "작은"수에 의해 제한된다는 가정하에 루프는 입력 목록의 길이에서 선형 시간 (O (n))을 가져야합니다.
정렬은 O (n log (n)) 비교를 사용합니다.

따라서 목록이 커지면 정렬 비용이 지배적입니다. 원하는만큼 루프를 미세 최적화 할 수 있지만 마지막에 정렬을 유지하는 한 큰 목록을 사용하더라도 노력은 큰 차이를 만들지 않습니다.

Python 스크립트를 작성하는 경우 접근 방식이 좋습니다. 거대한 목록이있는 성능이 정말로 필요하다면 아마도 다른 언어를 사용하고있을 것입니다. 그럼에도 불구하고 실제로 공연에 관심이 있거나 새로운 것을 배우려는 경우 다음 접근 방식 중 하나를 시도 할 수 있습니다.

일반 정렬 알고리즘을 문자열에 특정한 것으로 대체하십시오. 예를 들어 여기 를 참조 하십시오
trie를 사용하면 정렬 할 필요가 없습니다. 이것은 이론적으로는 더 좋을 수 있지만 실제로는 더 나쁠 수 있습니다.

완전성을 위해 마이크로 최적화로 날짜가 고정 길이 10 자라고 가정합니다.

def get_player_path(player_file_list, total_players):
    player_files_to_process = set()
    for player_file in player_file_list:
        end = player_file.find('/', 12)       # <--- len(date) + len('/') + 1
        file_path = player_file[:end]         # <---
        player_files_to_process.add(file_path)
        if len(player_files_to_process) == total_players:
            break
    return sorted(player_files_to_process)

예제 목록과 같이 ID의 길이도 고정되어 있으면 분할이나 찾기가 필요하지 않습니다.

LENGTH = DATE_LENGTH + ID_LENGTH + 1   # 1 is for the slash between date and id
...
for player_file in player_file_list:
    file_path = player_file[:LENGTH]
...

편집 : LENGTH초기화 수정, 1을 추가하는 것을 잊었습니다.

AajKaal Nov 30 2020 at 21:19

목록이 이미 정렬되어 있으므로 정렬 할 필요가 없도록 dict를 사용하십시오. 여전히 정렬해야하는 경우 return 문에서 항상 sorted를 사용할 수 있습니다. import re를 추가하고 다음과 같이 함수를 바꿉니다.