ํ‹ฐ์Šคํ† ๋ฆฌ ๋ทฐ

๐Ÿ’ฌ ์™œ ์ด๋Ÿฌํ•œ ์ผ์„ ํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€

  ๋จผ์ € ์กธ์—… ํ”„๋กœ์ ํŠธ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ธฐ๋Šฅ์€ ์ž๋ฆฝ์ค€๋น„์ฒญ๋…„์ด ํ…์ŠคํŠธ๋กœ ๊ถ๊ธˆํ•œ ์ ์— ๋Œ€ํ•ด ๋ฌผ์–ด๋ณด์•˜์„ ๋•Œ, ์ฑ—๋ด‡์ด ์‚ฌ์šฉ์ž์—๊ฒŒ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ์ด๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์ฑ—๋ด‡์ด ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ด์ค„ ๋•Œ์˜ ์ •ํ™•์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด์„œ RAG๋ฅผ ์ด์šฉํ•˜๊ธฐ๋กœ ํ•˜์˜€๋‹ค. RAG ๊ตฌ์ถ•ํ•˜๋Š” ๊ณผ์ •์—์„œ chroma DB๋กœ ๋ฒกํ„ฐ DB๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณ  ์ด ๋ฒกํ„ฐ DB์— ์ž๋ฆฝ์ค€๋น„์ฒญ๋…„์—๊ฒŒ ๋„์›€์ด ๋˜๋Š” ์ •๋ณด(์ฃผ๊ฑฐ ์ •๋ณด, ์ทจ์—… ์ •๋ณด, ์กฐ๋ก€ ๋“ฑ)๋ฅผ ํฌ๋กค๋งํ•˜์—ฌ ๋„ฃ์–ด์ฃผ๊ธฐ๋กœ ํ•˜์˜€๋‹ค.

 

๐Ÿฅ selenium์œผ๋กœ ํฌ๋กค๋ง์„ ํ•ด๋ณด์ž!

https://wikidocs.net/137914

 

6) ๋™์  ์›นํฌ๋กค๋ง - selenium ์†Œ๊ฐœ ๋ฐ ๊ธฐ์ดˆ์‚ฌ์šฉ๋ฒ•

# 1. Selenium ํŒจํ‚ค์ง€๋ž€? selenium ํŒจํ‚ค์ง€๋Š” chromedriver๋ฅผ ์ œ์–ดํ•˜๊ฑฐ๋‚˜ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํฌ๋กค๋ง์„ ํ•˜๋‹ค๋ณด๋ฉด ๋ฌด์—‡์ธ๊ฐ€ ์ž…๋ ฅํ•˜๊ฑฐ๋‚˜ …

wikidocs.net

์œ„์˜ ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

 

 


selenium์€ ๋™์  ํฌ๋กค๋ง์— ์ ํ•ฉํ•ด์„œ ์„ ํƒํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ๋จผ์ € ์šฐ๋ฆฌ๋Š” ์ž๋ฆฝ์ •๋ณดON์ด๋ผ๋Š” ํ™ˆํŽ˜์ด์ง€ ์ƒ์˜ ์ž๋ฆฝ ์ง€์› ์‚ฌ์—…์— ๋Œ€ํ•œ txt์™€ ์ด๋ฏธ์ง€ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ค๊ธฐ๋กœ ํ–ˆ๋‹ค. 

 

๋จผ์ € selenium์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ด€๋ จ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•ด์•ผ ํ•œ๋‹ค. ๋‚˜๋Š” ์•„๋‚˜์ฝ˜๋‹ค๋กœ ๊ฐ€์ƒํ™˜๊ฒฝ์„ ์„ธํŒ…ํ•ด์ฃผ์—ˆ๋‹ค.

pip install selenium

 

์„ฑ๊ณต์ ์œผ๋กœ ์„ค์น˜๊ฐ€ ๋˜๋ฉด, ์ด์ œ ๋ณธ๊ฒฉ์ ์œผ๋กœ ์›น์‚ฌ์ดํŠธ์— ์ ‘์†ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์•„๋ž˜์™€ ๊ฐ™์ด ํŒŒ์ด์ฌ ์ฝ”๋“œ ์ƒ์—์„œ import๋ฅผ ์ž‘์„ฑํ•ด์ค๋‹ˆ๋‹ค.

import os
import requests
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
from selenium import webdriver

 

from selenium.webdriver.common.by import By

html ์ƒ์—์„œ class๋‚˜ id ์ด๋ฆ„์„ ๊ฐ€์ง€๊ณ  ์˜ค๊ธฐ ์œ„ํ•ด์„œ ํ•„์š”ํ•˜๋‹ค

 

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

์›นํŽ˜์ด์ง€ ์ƒ์—์„œ ์–ด๋–ค ์š”์†Œ๊ฐ€ ๋กœ๋“œ๋  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ ํ•„์š”ํ•˜๋‹ค.

 

from selenium import webdriver

๋ธŒ๋ผ์šฐ์ € ์ž๋™ ์ œ์–ด๋ฅผ ์œ„ํ•ด ํ•„์š”ํ•˜๋‹ค.

 

# ๋ฉ”์ธ ํŽ˜์ด์ง€๋กœ ์ด๋™
driver.get("https://jaripon.ncrc.or.kr/home/kor/main.do")

# ๊ฒฝ๊ณ ์ฐฝ ์ฒ˜๋ฆฌ
try:
    WebDriverWait(driver, 5).until(EC.alert_is_present())
    alert = driver.switch_to.alert
    alert.accept()
    print("Alert accepted.")
except Exception as e:
    print(f"No alert found or error occurred: {e}")

# ์ž๋ฆฝ์ •๋ณด ์กฐํšŒ ํŽ˜์ด์ง€๋กœ ์ด๋™
script = "fn_menu_move('/home/kor/support/projectMng/index.do', '3');"
driver.execute_script(script)

 

๋ฉ”์ธ ํŽ˜์ด์ง€๋กœ ์ด๋™ํ•œ ํ›„, ์ž๋ฆฝ์ •๋ณด ์ง€์›์‚ฌ์—…์ด ์กด์žฌํ•˜๋Š” ํŽ˜์ด์ง€๋กœ ์ด๋™ํ•˜๊ฒŒ๋” request๋ฅผ ๋ณด๋‚ด๋Š” ์ฝ”๋“œ๋ฅผ ๋งŒ๋“ค์–ด์ค€๋‹ค.

์ด ๋•Œ์— ๋ฐ”๋กœ ์ž๋ฆฝ์ •๋ณด ์ง€์›์‚ฌ์—…์ด ์กด์žฌํ•˜๋Š” ํŽ˜์ด์ง€๋กœ ์ด๋™ํ•˜์ง€ ์•Š์€ ์ด์œ ๊ฐ€ ์žˆ๋‹ค!

 

๋ฐ”๋กœ ํ•ด๋‹นํŽ˜์ด์ง€๋กœ ์ ‘๊ทผํ•œ ๊ฒฝ์šฐ์—๋Š” "๋น„์ •์ƒ์  ์ ‘๊ทผ์ž…๋‹ˆ๋‹ค"์™€ ๊ฐ™์€ ์—๋Ÿฌ ๋ฉ”์‹œ์ง€๊ฐ€ ํ‘œ์‹œ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๊ทธ๋ž˜์„œ ์ด๋Ÿฌํ•œ ์—๋Ÿฌ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ด์šฉํ•ด์ค€๋‹ค. 

driver.execute_script(์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ์ฝ”๋“œ)

 

์ด๋Ÿฐ ์‹์œผ๋กœ ์ž‘์„ฑํ•˜๊ฒŒ ๋˜๋ฉด ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ์ฝ”๋“œ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋™์ž‘์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋‹ค.

script = "fn_menu_move('/home/kor/support/projectMng/index.do', '3');"

์ด ๋ถ€๋ถ„์—์„œ fn_menu_move๋Š” ์ •์˜๋˜์ง€ ์•Š์€ ํ•จ์ˆ˜์ธ๋ฐ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋ƒ๊ณ  ํ•  ์ˆ˜๋„ ์žˆ๋ƒ๊ณ  ๋ฌผ์„ ์ˆ˜๋„ ์žˆ๋‹ค!

<a href="javascript:void(0);" onclick="fn_menu_move('/home/kor/servic/login/index.do','25');">
	<span class="text">๋กœ๊ทธ์ธ</span>
</a>

 

์ด๋Ÿฐ ์‹์œผ๋กœ ์ด๋ฏธ ํ™ˆํŽ˜์ด์ง€ ๋‚ด์— ์žˆ๋Š” ํ•จ์ˆ˜์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ–๋‹ค ์“ธ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด์—ˆ๋‹ค.

๋‹ค๋งŒ ๋‹ค๋ฅธ ํ™ˆํŽ˜์ด์ง€์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํ•จ์ˆ˜๊ฐ€ ์—†์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ๊ณณ์— ์ ์šฉํ•  ๋•Œ๋Š” ์œ ์˜ํ•ด์•ผํ•œ๋‹ค.

# ํŽ˜์ด์ง€ ๋กœ๋”ฉ ๋ฐ div.gallery_list > ul.list ์š”์†Œ ํ™•์ธ
try:
    gallery_list = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "div.gallery_list ul.list"))
    )
    print("gallery_list found.")

    # ๊ฐ li > a ์š”์†Œ ํƒ์ƒ‰ ๋ฐ ํ…์ŠคํŠธ/์ด๋ฏธ์ง€ ์ €์žฅ ๋ฐ˜๋ณต
    for idx in range(10):  #10๊ฐœ์˜ ํ•ญ๋ชฉ ์ฒ˜๋ฆฌ
        try:
            # ํ˜„์žฌ li > a ์š”์†Œ๋“ค ๋กœ๋“œ
            li_elements = gallery_list.find_elements(By.CSS_SELECTOR, "li > a")

            # a ํƒœ๊ทธ ํด๋ฆญํ•˜์—ฌ ์ƒˆ ํŽ˜์ด์ง€๋กœ ์ด๋™
            a_tag = li_elements[idx]
            driver.execute_script("arguments[0].click();", a_tag)

            # ์ƒˆ ํŽ˜์ด์ง€๊ฐ€ ๋กœ๋“œ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )

            # <div class="editor_view">์—์„œ ํ…์ŠคํŠธ ๋ฐ ์ด๋ฏธ์ง€ URL ๊ฐ€์ ธ์˜ค๊ธฐ
            editor_view = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "div.editor_view"))
            )

            # ํ…์ŠคํŠธ ๋‚ด์šฉ ๊ฐ€์ ธ์™€์„œ ์ €์žฅ
            text_content = editor_view.text
            text_filename = f"content_{idx + 1}.txt"
            with open(text_filename, "w", encoding="utf-8") as file:
                file.write(text_content)
            print(f"Saved text content to {text_filename}.")

            # ์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์ €์žฅ
            image_elements = editor_view.find_elements(By.TAG_NAME, "img")
            for img_idx, img in enumerate(image_elements):
                img_url = img.get_attribute("src")
                if img_url:
                    img_data = requests.get(img_url).content
                    image_filename = f"content_{idx + 1}_image_{img_idx + 1}.jpg"
                    with open(image_filename, "wb") as img_file:
                        img_file.write(img_data)
                    print(f"Saved image to {image_filename}.")

            # ์›๋ž˜ ํŽ˜์ด์ง€๋กœ ๋Œ์•„๊ฐ€๊ธฐ
            driver.back()

            # ์ž๋ฆฝ์ •๋ณด ์กฐํšŒ ํŽ˜์ด์ง€๋กœ ์žฌ์ง„์ž… (JavaScript ์‚ฌ์šฉ)
            driver.execute_script(script)
            gallery_list = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "div.gallery_list ul.list"))
            )

            # ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋ณด์ด๊ฒŒ ๋Œ€๊ธฐ
            time.sleep(random.uniform(1, 3))

        except Exception as e:
            print(f"Error processing link {idx + 1}: {e}")
            break

except Exception as e:
    print(f"Error finding gallery_list or list items: {e}")

# ๋ธŒ๋ผ์šฐ์ € ์ข…๋ฃŒ
driver.quit()

 

์‚ฌ์‹ค ์œ„์˜ ์ฝ”๋“œ๊ฐ€ ์ค‘์ ์ ์ธ ๋‚ด์šฉ์ด๋‹ค. ์ฐฌ์ฐฌํžˆ ์‚ดํŽด๋ณด๊ฒ ๋‹ค...!

gallery_list = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "div.gallery_list ul.list"))
    )
print("gallery_list found.")

webDriverWait์„ ํ†ตํ•ด์„œ ๋กœ๋“œ๋  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ค์ฃผ๊ณ  css_selector์„ ํ†ตํ•ด ํ•ด๋‹น html ๋‚ด์˜ class์ด๋ฆ„์ด galley_list์ธ div ํƒœ๊ทธ ์•ˆ์— list๋ผ๋Š” class ์ด๋ฆ„์„ ๊ฐ€์ง„ list ํƒœ๊ทธ๊ฐ€ ์žˆ๋Š”์ง€ ์ฐพ์•„์ค€๋‹ค.

 

 

์ด์ œ ๋ฐ˜๋ณต๋ฌธ ๋‚ด์— ์ฝ”๋“œ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ๋‹ค.

 

๋ฐ˜๋ณต๋ฌธ ๋‚ด์—์„œ๋Š” ๊ฐ๊ฐ์˜ ์‚ฌ์—…์— ํ•ด๋‹นํ•˜๋Š” ํŽ˜์ด์ง€์— ์ ‘์†ํ•œ ํ›„

ํ•ด๋‹น ์‚ฌ์—…์˜ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ํŒŒ์ผ์„ ๋กœ์ปฌ์— ์ €์žฅํ•˜๋Š” ๋‚ด์šฉ์ด๋‹ค.

# ํ˜„์žฌ li > a ์š”์†Œ๋“ค ๋กœ๋“œ
            li_elements = gallery_list.find_elements(By.CSS_SELECTOR, "li > a")

            # a ํƒœ๊ทธ ํด๋ฆญํ•˜์—ฌ ์ƒˆ ํŽ˜์ด์ง€๋กœ ์ด๋™
            a_tag = li_elements[idx]
            driver.execute_script("arguments[0].click();", a_tag)

            # ์ƒˆ ํŽ˜์ด์ง€๊ฐ€ ๋กœ๋“œ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )

            # <div class="editor_view">์—์„œ ํ…์ŠคํŠธ ๋ฐ ์ด๋ฏธ์ง€ URL ๊ฐ€์ ธ์˜ค๊ธฐ
            editor_view = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "div.editor_view"))
            )

์œ„์™€ ๊ฐ™์ด ๊ฐ๊ฐ์˜ ํƒœ๊ทธ๋ฅผ ์ฐพ๋Š” ๊ณผ์ •์ด ์ฒซ ๋‹จ๊ณ„์ด๊ณ 

# ํ…์ŠคํŠธ ๋‚ด์šฉ ๊ฐ€์ ธ์™€์„œ ์ €์žฅ
            text_content = editor_view.text
            text_filename = f"content_{idx + 1}.txt"
            with open(text_filename, "w", encoding="utf-8") as file:
                file.write(text_content)
            print(f"Saved text content to {text_filename}.")

            # ์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ ๋ฐ ์ €์žฅ
            image_elements = editor_view.find_elements(By.TAG_NAME, "img")
            for img_idx, img in enumerate(image_elements):
                img_url = img.get_attribute("src")
                if img_url:
                    img_data = requests.get(img_url).content
                    image_filename = f"content_{idx + 1}_image_{img_idx + 1}.jpg"
                    with open(image_filename, "wb") as img_file:
                        img_file.write(img_data)
                    print(f"Saved image to {image_filename}.")

 

ํ•ด๋‹น ํƒœ๊ทธ์˜ text, img๋ฅผ ์ €์žฅํ•ด์ค€๋‹ค.

text์˜ ๊ฒฝ์šฐ์—๋Š” editor_view.text์™€ ๊ฐ™์ด ๋ณ„๋‹ค๋ฅธ ๊ณผ์ • ์—†์ด text๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค.

๋‹ค๋งŒ ์šฐ๋ฆฌ๋Š” ํ•œ๊ตญ์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์ธ์ฝ”๋”ฉ ๊ณผ์ •์—์„œ utf-8๋กœ ์„ค์ •ํ•ด์•ผ ๊ธ€์ž๊ฐ€ ์•ˆ ๊นจ์งˆ ์ˆ˜ ์žˆ๋‹ค. 

ํŒŒ์ด์ฌ ๋‚ด์žฅ ํ•จ์ˆ˜์ธ open์„ ์ด์šฉํ•˜๋ฉด txt ํŒŒ์ผ์„ ์ €์žฅํ•  ์ˆ˜ ์žˆ๋‹ค.

with as file์„ ์ด์šฉํ•˜๋ฉด open ํ›„ closeํ•ด์•ผ ํ•˜๋Š” ์ˆ˜๊ณ ๋ฅผ ๋œ ์ˆ˜ ์žˆ์–ด์„œ ์ด๋ ‡๊ฒŒ ์ž‘์„ฑํ•˜๋Š” ๊ฒŒ ํŽธํ•˜๋‹ค.

 

with open(text_filename, "w", encoding="utf-8") as file:
                file.write(text_content)

์ด๊ฑฐ๋ฅผ with ~ as file์„ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ์ง€ ์•Š์œผ๋ฉด 

file = open(text_filename, "w", encoding="utf-8")

file.write(text_content)

file.close

๋กœ ์ž‘์„ฑํ•˜๋ฉด ๋œ๋‹ค.

 

์ฐธ๊ณ ๋กœ "w"๋Š” write, "r"๋Š” read์—ฌ์„œ ์šฐ๋ฆฌ๋Š” ์—ฌ๊ธฐ์„œ write๋ฅผ ํ•˜๊ณ  ์‹ถ์–ด์„œ open์˜ ๋‘๋ฒˆ์งธ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ w๋ฅผ ์ž‘์„ฑํ•ด์ค€ ๊ฒƒ์ด๋‹ค.

 

์ด๋ฏธ์ง€์˜ ๊ฒฝ์šฐ์—๋Š” ํ…์ŠคํŠธ์™€ ๋‹ฌ๋ฆฌ ๋ฒˆ๊ฑฐ๋กœ์šด ๊ณผ์ •์„ ๊ฑฐ์ณ์•ผ ํ•œ๋‹ค.

img ํƒœ๊ทธ -> img ํƒœ๊ทธ ์•ˆ์˜ src ์š”์†Œ ๊ฐ€์ง€๊ณ  ์˜ค๊ธฐ -> src ์š”์†Œ๊ฐ€ ์žˆ์œผ๋ฉด, ๊ฐ€์ง€๊ณ  ์™€์„œ ์ €์žฅ.

์—ฌ๊ธฐ์„œ open์˜ ๋‘ ๋ฒˆ์งธ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ wb๋ฅผ ์ž‘์„ฑํ–ˆ๋Š”๋ฐ ์ด๋Š” ๋ฐ”์ด๋„ˆ๋ฆฌ ๋ชจ๋“œ๋กœ ์“ฐ๊ธฐ๋ฅผ ํ•ด์ฃผ๊ธฐ ์œ„ํ•จ์ด๋‹ค.

์ด๋ฏธ์ง€ ๊ฐ์ฒด๋ฅผ ๋ฐ”์ด๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด์„œ๋Š” pickle ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค.

 

๊ทธ ๋‹ค์Œ์—๋Š” ํ•ด๋‹น ํŽ˜์ด์ง€์—์„œ ์ž‘์—…์„ ์™„๋ฃŒํ•˜์˜€์œผ๋ฏ€๋กœ ๋‹ค์‹œ ์ง€์› ์‚ฌ์—… ํŽ˜์ด์ง€๋กœ ์ด๋™ํ•˜๊ฒŒ ๋œ๋‹ค.

์ดํ›„์—๋Š” ๋ธŒ๋ผ์šฐ์ €๋ฅผ ๋‹ซ์•„์ค€๋‹ค.

 

๋งŒ์•ฝ์— google colab์—์„œ ํ•ด๋‹น ์ž‘์—…์„ ์ง„ํ–‰ํ•˜๊ฒŒ ๋  ๊ฒฝ์šฐ์—๋Š”, head-less ๋ชจ๋“œ๋ฅผ ๊ผญ ์ ์šฉํ•ด์•ผ ์ฝ”๋“œ๊ฐ€ ์ž‘๋™๊ฐ€๋Šฅํ•˜๋‹ค.

 

๐Ÿฅ selenium์œผ๋กœ ํฌ๋กค๋งํ•œ ํŒŒ์ผ์„ google drive์— ์ €์žฅํ•ด๋ณด์ž

ํ˜„์žฌ๊นŒ์ง€๋Š” ๋กœ์ปฌ์— ์ €์žฅํ•˜๋Š” ๊ฒฝ์šฐ๊ณ  ํŒŒ์ผ๋“ค google drive์— ์ €์žฅํ•˜๋Š” ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค.

์ด๋•Œ๋Š” google drive api๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ์•ผ ํ•œ๋‹ค.

 

https://console.cloud.google.com

 

Google ํด๋ผ์šฐ๋“œ ํ”Œ๋žซํผ

๋กœ๊ทธ์ธ Google ํด๋ผ์šฐ๋“œ ํ”Œ๋žซํผ์œผ๋กœ ์ด๋™

accounts.google.com

 

ํ•ด๋‹น ํŽ˜์ด์ง€์— ์ ‘์†ํ•ด์ค€๋‹ค.

 

 

๋‚˜๋Š” ์ด๋ฏธ storage๋ฅผ ๋งŒ๋“ค์–ด๋‘” ์ƒํƒœ๋ผ ์ด๋Ÿฐ ํŽ˜์ด์ง€๊ฐ€ ๋œจ๋Š”๋ฐ ์ฒ˜์Œ์ธ ๊ฒฝ์šฐ์—๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ๋งŒ๋“ค๋ผ๊ณ  ํ•˜๋Š” ๋‚ด์šฉ์ด ๋œฐ ๊ฑฐ๋‹ค.

์–ด์จŒ๋“  ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด๋ณด๋ฉด

"์ƒˆ ํ”„๋กœ์ ํŠธ"๋ฅผ ํด๋ฆญํ•ด์ค€๋‹ค.

๊ทธ๋Ÿฌ๋ฉด ์ด๋Ÿฐ ์‹์œผ๋กœ ํ”„๋กœ์ ํŠธ ์ด๋ฆ„์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

์กฐ์ง์ด ์—†์œผ๋ฉด ์ƒ์„ฑ์ด ์•ˆ๋˜๋Š”๋ฐ ๋‚˜๋Š” ํ•™๊ต ์ด๋ฉ”์ผ์ด๋ผ ํ•™๊ต๋ฅผ ์กฐ์ง์œผ๋กœ ์„ ํƒํ•ด์ฃผ์—ˆ๋‹ค.

๋ฉ”๋‰ด๋ฐ” > API ๋ฐ ์„œ๋น„์Šค > ์‚ฌ์šฉ ์„ค์ •๋œ API ๋ฐ ์„œ๋น„์Šค๋ฅผ ํด๋ฆญํ•œ๋‹ค.

์ด๋Ÿฐ ํ™”๋ฉด์ด ๋‚˜์˜ค๋Š”๋ฐ ์ด๋•Œ "+ API ๋ฐ ์‚ฌ๋น„์Šค ์‚ฌ์šฉ ์„ค์ •"์„ ๋ˆ„๋ฅธ๋‹ค.

ํ•ด๋‹น ํ™”๋ฉด์ด ๋‚˜์˜ค๋Š”๋ฐ ๊ฒ€์ƒ‰์ฐฝ์— google drive api๋ฅผ ๊ฒ€์ƒ‰ํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ ์ œ์ผ ์ƒ๋‹จ์— ๋‚˜์˜ค๋Š” Google Drive API๋ฅผ ์„ ํƒํ•œ๋‹ค.

์‚ฌ์šฉ ๋ฒ„ํŠผ์„ ๋ˆŒ๋Ÿฌ์ค€๋‹ค.

์˜ค๋ฅธ์ชฝ ์ƒ๋‹จ์˜ ์‚ฌ์šฉ์ž ์ธ์ฆ ์ •๋ณด ๋งŒ๋“ค๊ธฐ๋ฅผ ์„ ํƒํ•œ๋‹ค.

์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํƒํ•˜๊ณ  ๋‹ค์Œ์„ ๋ˆŒ๋Ÿฌ์ค€๋‹ค.

๊ทธ๋Ÿฌ๋ฉด ์œ„์™€ ๊ฐ™์€ ์ฐฝ์ด ๋‚˜์˜ค๊ณ  ์ด ์ฐฝ์—์„œ service ์•„์ด๋””๋งŒ ์ ์–ด์ฃผ๋ฉด ๋œ๋‹ค.

 

์ด์ œ OAuth ๋™์˜ ํ™”๋ฉด์— ๋“ค์–ด๊ฐ€์„œ

User Type์„ ์„ ํƒํ•ด์ค€๋‹ค. ์กฐ์ง ๋‚ด ์‚ฌ์šฉ์ž๋งŒ ์‚ฌ์šฉํ•  ๊ฑฐ๋ฉด ๋‚ด๋ถ€๋ฅผ ํ•ด๋„ ๋˜๊ณ  ์™ธ๋ถ€ ์‚ฌ์šฉ์ž๋„ ํ—ˆ์šฉํ• ๊ฑฐ๋ฉด ์™ธ๋ถ€๋ฅผ ์„ ํƒํ•ด๋„ ๋œ๋‹ค. ๋‚˜๋Š” ์ผ๋‹จ ์™ธ๋ถ€๋ฅผ ์„ ํƒํ•ด์ฃผ์—ˆ๋‹ค.

์•ฑ ์ด๋ฆ„์ด๋ž‘ ์‚ฌ์šฉ์ž ์ง€์› ์ด๋ฉ”์ผ์„ ์ ์–ด์ฃผ์—ˆ๋‹ค. ์ด๊ฑด ์˜ˆ์‹œ๋ผ ์•ฑ ์ด๋ฆ„์€ ์•„๋ฌด๋ ‡๊ฒŒ๋‚˜ ์ง€์—ˆ๋Š”๋ฐ ์ด๋ ‡๊ฒŒ ์ง€์œผ๋ฉด ์•ˆ๋œ๋‹ค!!!

์Šคํฌ๋กค์„ ๋‚ด๋ ค ๊ฐœ๋ฐœ์ž ์—ฐ๋ฝ์ฒ˜ ์ •๋ณด ์ž…๋ ฅ ํ›„ ์ €์žฅ ํ›„ ๊ณ„์†์„ ๋ˆ„๋ฅธ๋‹ค.

๋‚˜๋จธ์ง€๋Š” ๊ทธ๋ƒฅ ์ €์žฅ ํ›„ ๊ณ„์† ๋ˆŒ๋Ÿฌ์ฃผ๋ฉด

์™„๋ฃŒ๋œ๋‹ค!!

 

์ด์ œ ์—ฌ๊ธฐ์„œ ์‚ฌ์šฉ์ž ์ธ์ฆ ์ •๋ณด๋กœ ๋Œ์•„๊ฐ€ ์ž‘์—…์— ์žˆ๋Š” ๋‹ค์šด๋กœ๋“œ ์•„์ด์ฝ˜์„ ๋ˆ„๋ฅธ๋‹ค.

์—ฌ๊ธฐ์„œ JSON ๋‹ค์šด๋กœ๋“œ๋ฅผ ํ•˜๋ฉด ๋œ๋‹ค.

ํ•ด๋‹น JSON์˜ ์ด๋ฆ„์„ credentials.json์œผ๋กœ ๋ณ€๊ฒฝํ•ด์ค€๋‹ค!

๋ชจ์ž์ดํฌ ์ฒ˜๋ฆฌ๋œ ์ด๋ฉ”์ด ๋ถ€๋ถ„์„ ๋ˆŒ๋Ÿฌ์ค€๋‹ค.

์—ฌ๊ธฐ์„œ ํ‚ค ์ถ”๊ฐ€ > ์ƒˆ ํ‚ค ๋งŒ๋“ค๊ธฐ

JSON ์„ ํƒ ํ›„ ๋งŒ๋“ค๊ธฐ๋ฅผ ๋ˆ„๋ฅด๋ฉด json์ด ๋‹ค์šด๋กœ๋“œ ๋˜๊ณ  ํ•ด๋‹น ํŒŒ์ผ ์ด๋ฆ„์„ service_accout.json์œผ๋กœ ๋ณ€๊ฒฝํ•ด์ค€๋‹ค.

๊ธฐ์กด์˜ ํฌ๋กค๋ง ํŒŒ์ด์ฌ ํŒŒ์ผ๊ณผ ๊ฐ™์€ ๊ฒฝ๋กœ์— ์•ž์„œ ๋‹ค์šด๋กœ๋“œํ•œ credential.json๊ณผ service_accout.json์„ ๋„ฃ์–ด์ค€๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  google drive์—์„œ ํด๋”๋ฅผ ํ•˜๋‚˜ ์ƒ์„ฑ ํ›„

์˜ค๋ฅธ์ชฝ์— ์•ก์„ธ์Šค ๊ด€๋ฆฌ๋ฅผ ๋ˆŒ๋Ÿฌ

์•ž์„œ ๋งŒ๋“ค์—ˆ๋˜ ์„œ๋น„์Šค ๊ณ„์ •์˜ ์ด๋ฉ”์ผ์„ ํŽธ์ง‘์ž๋กœ ์ถ”๊ฐ€ํ•˜๋ฉด ๋œ๋‹ค. 

๊ทธ๋ฆฌ๊ณ  ํ•ด๋‹น ํด๋”์˜ ์ฃผ์†Œ https://drive.google.com/drive/folders/~~~~~ ์—์„œ ๋ฌผ๊ฒฐ ๋ถ€๋ถ„์— ํ•ด๋‹น๋˜๋Š” ๋ถ€๋ถ„์„ ๋ณต์‚ฌํ•œ๋‹ค.

 

๊ธฐ์กด ํŒŒ์ด์ฌ ํŒŒ์ผ์— ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•ด์ค€๋‹ค.

# ์„œ๋น„์Šค ๊ณ„์ • ํ‚ค ํŒŒ์ผ ๊ฒฝ๋กœ
SERVICE_ACCOUNT_FILE = 'service_account.json'

# Google Drive API ์ธ์ฆ (์„œ๋น„์Šค ๊ณ„์ •)
def authenticate_google_drive():
    credentials = Credentials.from_service_account_file(
        SERVICE_ACCOUNT_FILE,
        scopes=['https://www.googleapis.com/auth/drive']
    )
    return build('drive', 'v3', credentials=credentials)

# Google Drive ๊ณต์œ  ํด๋” ID
FOLDER_ID = '์•„๊นŒ ๋ณต์‚ฌํ•œ ๋ถ€๋ถ„ ๋„ฃ๊ธฐ!!'

# ํŒŒ์ผ ์—…๋กœ๋“œ ํ•จ์ˆ˜
def upload_to_drive(service, file_name, file_data, mime_type):
    file_metadata = {
        'name': file_name,
        'parents': [FOLDER_ID]
    }
    media = MediaIoBaseUpload(io.BytesIO(file_data), mimetype=mime_type)
    file = service.files().create(body=file_metadata, media_body=media, fields='id').execute()
    print(f"Uploaded {file_name} to shared folder with ID: {file.get('id')}")
    
    # Google Drive API ์ธ์ฆ
drive_service = authenticate_google_drive()

 

๊ทธ๋ฆฌ๊ณ  ์ €์žฅ๋˜๋Š” ๋ถ€๋ถ„์—์„œ๋„ ๊ธฐ์กด์˜ ์ฝ”๋“œ file.write ๋Œ€์‹  ์•„๋ž˜์™€ ๊ฐ™์ด ์ž‘์„ฑํ•ด์ค€๋‹ค.

 upload_to_drive(drive_service, text_file_name, text_data, "text/plain")

 

๊ทธ๋Ÿฌ๋ฉด ์ฝ”๋“œ ์‹คํ–‰ ์‹œ google drive์— ์•Œ์•„์„œ ์˜ฌ๋ผ๊ฐ€๋Š” ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ๋‹ค!

 

๐Ÿฅ ์ „์ฒด ์ฝ”๋“œ

import os
import requests
import random
import time
import io
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseUpload
from google.oauth2.service_account import Credentials

# ์„œ๋น„์Šค ๊ณ„์ • ํ‚ค ํŒŒ์ผ ๊ฒฝ๋กœ
SERVICE_ACCOUNT_FILE = 'service_account.json'

# Google Drive API ์ธ์ฆ (์„œ๋น„์Šค ๊ณ„์ •)
def authenticate_google_drive():
    credentials = Credentials.from_service_account_file(
        SERVICE_ACCOUNT_FILE,
        scopes=['https://www.googleapis.com/auth/drive']
    )
    return build('drive', 'v3', credentials=credentials)

# Google Drive ๊ณต์œ  ํด๋” ID
FOLDER_ID = ''

# ํŒŒ์ผ ์—…๋กœ๋“œ ํ•จ์ˆ˜
def upload_to_drive(service, file_name, file_data, mime_type):
    file_metadata = {
        'name': file_name,
        'parents': [FOLDER_ID]
    }
    media = MediaIoBaseUpload(io.BytesIO(file_data), mimetype=mime_type)
    file = service.files().create(body=file_metadata, media_body=media, fields='id').execute()
    print(f"Uploaded {file_name} to shared folder with ID: {file.get('id')}")

# WebDriver ์„ค์ •
driver = webdriver.Chrome()

# ๋ฉ”์ธ ํŽ˜์ด์ง€๋กœ ์ด๋™
driver.get("https://jaripon.ncrc.or.kr/home/kor/main.do")

# ๊ฒฝ๊ณ ์ฐฝ ์ฒ˜๋ฆฌ
try:
    WebDriverWait(driver, 5).until(EC.alert_is_present())
    alert = driver.switch_to.alert
    alert.accept()
except Exception:
    pass

# JavaScript๋กœ ํŽ˜์ด์ง€ ์ด๋™
script = "fn_menu_move('/home/kor/support/projectMng/index.do', '3');"
driver.execute_script(script)

# Google Drive API ์ธ์ฆ
drive_service = authenticate_google_drive()

try:
    gallery_list = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "div.gallery_list ul.list"))
    )

    for idx in range(10):
        try:
            li_elements = gallery_list.find_elements(By.CSS_SELECTOR, "li > a")
            a_tag = li_elements[idx]
            driver.execute_script("arguments[0].click();", a_tag)

            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )

            editor_view = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "div.editor_view"))
            )

            # ํ…์ŠคํŠธ ์—…๋กœ๋“œ
            text_content = editor_view.text
            if text_content:
                text_data = text_content.encode('utf-8')
                text_file_name = f"content_{idx + 1}.txt"
                upload_to_drive(drive_service, text_file_name, text_data, "text/plain")

            # ์ด๋ฏธ์ง€ ์—…๋กœ๋“œ
            image_elements = editor_view.find_elements(By.TAG_NAME, "img")
            for img_idx, img in enumerate(image_elements):
                img_url = img.get_attribute("src")
                if img_url:
                    img_data = requests.get(img_url).content
                    image_file_name = f"content_{idx + 1}_image_{img_idx + 1}.jpg"
                    upload_to_drive(drive_service, image_file_name, img_data, "image/jpeg")

            driver.back()
            driver.execute_script(script)
            gallery_list = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "div.gallery_list ul.list"))
            )
            time.sleep(random.uniform(1, 3))

        except Exception as e:
            print(f"Error processing link {idx + 1}: {e}")
            break

except Exception as e:
    print(f"Error finding gallery_list or list items: {e}")

driver.quit()

 

๐Ÿฅ github์— ์˜ฌ๋ฆด ๋•Œ๋Š” json ํŒŒ์ผ์„ .gitignore์— ์ถ”๊ฐ€ํ•˜๋Š” ๊ฑธ ์žŠ์ง€ ๋ง์ž!

๊ณต์ง€์‚ฌํ•ญ
์ตœ๊ทผ์— ์˜ฌ๋ผ์˜จ ๊ธ€
์ตœ๊ทผ์— ๋‹ฌ๋ฆฐ ๋Œ“๊ธ€
Total
Today
Yesterday
๋งํฌ
TAG
more
ยซ   2024/12   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
๊ธ€ ๋ณด๊ด€ํ•จ