Skip to main content

pd.read_csv(), pd.to_csv() -> Modin ๐Ÿ‘

ยท 2 min read

image

๋จธ์‹ ๋Ÿฌ๋‹์„ ํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ๋น…๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ฑฐ๋‚˜ ์ž‘์€ ํฌ๋กค๋ง ์ž‘์—…์„ ํ•˜๋‹ค ๋ณด๋ฉด csv ํŒŒ์ผ์„ ๋งŽ์ด ์ด์šฉํ•˜๊ฒŒ ๋˜๋Š”๋ฐ ์ด ๋•Œ pandas ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด pd.read_csv() / pd.to_csv() ๋ช…๋ น์„ ์ž์ฃผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค.

ํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๊ทธ ์„ฑ๋Šฅ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธฐ๊ณ  ๋งŽ์€ ์‹œ๊ฐ„์ด ์†Œ์š”๋˜๋Š” ๋“ฑ ๋ถˆํŽธํ•œ ์ ๋“ค์ด ์ ์  ๋ฐœ์ƒํ•œ๋‹ค.

๊ทธ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋ถ„์‚ฐ์ปดํ“จํŒ… ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ณด๋‹ค ํšจ์œจ์ ์ธ ์ƒˆ๋กœ์šด ๋Œ€์•ˆ์„ ์ฐพ์•„๋ณด๊ฒŒ ๋๋‹ค.

read_csv() ์˜ ๊ฒฝ์šฐ

import pandas as pd
import dask.dataframe as dd

# Reading a large CSV file with pandas
df_pandas = pd.read_csv('large_dataset.csv')
# Reading the same file with dask
df_dask = dd.read_csv('large_dataset.csv')
# Timing the execution
%timeit df_pandas.head()
%timeit df_dask.head()

์œ„์—๋Š” dask๋ฅผ ํ™œ์šฉํ•œ ์˜ˆ์‹œ์ธ๋ฐ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๊ฐ€ ํด์ˆ˜๋ก ๋” ๋งŽ์€ ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. 

to_csv() ์˜ ๊ฒฝ์šฐ

import pandas as pd
import fastparquet

# Saving a DataFrame to a Parquet file
df = pd.DataFrame({'column1': [1, 2, 3], 'column2': ['a', 'b', 'c']})
fastparquet.write('output.parquet', df)

์œ„์™€ ๊ฐ™์ด fastparquet์„ ํ™œ์šฉํ•ด ๊ณต๊ฐ„ ๋ฐ ์„ฑ๋Šฅ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

์„ค๋ช…ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฐฉ๋ฒ•๋“ค์ด ๋งŽ์ง€๋งŒ Modin์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒŒ ๋์Šต๋‹ˆ๋‹ค!

Modin์œผ๋กœ csv๋ฅผ ์ฝ์–ด์˜ค๋Š” ๋ฒ•

import modin.pandas as pd

# Reading a CSV file with Modin
df = pd.read_csv('data.csv')

๋‹จ์ˆœํžˆ pandas ๋ฅผ ์œ„์™€ ๊ฐ™์ด ๋Œ€์ฒด๋งŒ ํ•ด์ค˜๋„ ๋˜๋Š” ์‰ฌ์šด ๋ฐฉ์‹์œผ๋กœ ์„ฑ๋Šฅ์ ์ด ์ด๋“์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

Modin์œผ๋กœ csv๋ฅผ ์“ฐ๋Š” ๋ฒ•

import modin.pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'column1': [1, 2, 3], 'column2': ['a', 'b', 'c']})
# Writing the DataFrame to a CSV file with Modin
df.to_csv('output.csv', index=False)

ํŒŒ์ผ์„ ์“ฐ๋Š” ๋ฒ•๋„ ์ฝ์–ด์˜ค๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋Œ€์ฒด๋งŒ ํ•ด์ฃผ๋ฉด ๋ฐ”๋กœ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๊ณ  ํ›จ์”ฌ ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์œผ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด์ฒ˜๋Ÿผ ๋งŽ์€ ๋Œ€์ฒด ๋ฐฉ๋ฒ• ์ค‘ Modin์„ ์†Œ๊ฐœํ•˜๋Š” ๊ฒƒ์€ modin.pandas๋งŒ ํ•ด์ฃผ๋ฉด ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์ง€ ์•Š์•„๋„ ์ด์šฉํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋งŒ์•ฝ pandas ๋งŒ์˜ ๊ณ ์œ  ๊ธฐ๋Šฅ์ด ์žˆ์–ด ๋‹ค์‹œ pandas ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฐ„๋‹จํžˆ switching ํ•  ์ˆ˜ ์žˆ๋‹ค.

import modin.pandas as pd

# Reading a CSV file with Modin
df = pd.read_csv('data.csv')
# Perform some data analysis with Modin
# Switch to pandas
df = df.__pandas__()
# Continue working with pandas
df.head()

๊ฒฐ๋ก 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค์€ ๊ณ„์† ๋ฐœ์ „ํ•˜๊ณ  ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋Œ€์ฒด ๋ฐฉ๋ฒ•๋“ค์€ ๊ณ„์† ๋‚˜์˜ค๊ณ  ์žˆ์œผ๋‹ˆ pandas ๋กœ ๋ง‰ํžŒ๋‹ค๋ฉด ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค์„ ์ฐพ์•„ ์Šคํ„ฐ๋””ํ•ด ๋ณด๋Š” ๊ฒƒ๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.