스크래핑 Scraping

김아무개·2023년 5월 30일
0

Spring Boot 🍃

목록 보기
24/95

스크래핑은 웹사이트의 정보 데이터를 긁어오는 행위를 말한다.

스크래핑을 잘못하게 될 경우 소송 걸릴 수 있으니 주의

/robots.txt

스크래핑을 하려는 사이트에서
가져가길 원하지 않는 데이터의 접근 uri를 명시해둔 파일이다.
해당 파일에서 Disallow 라고 적혀있는 주소에만 접근하지 않으면 된다.

robots.txt 찾는 방법 : url/robots.txt

ex : https://finance.yahoo.com/robots.txt
ex : https://finance.naver.com/robots.txt

Jsoup을 이용한 스크래핑 예시

String url = "https://search.naver.com/search.naver?where=view&sm=tab_jum&query=%EC%8A%A4%ED%81%AC%EB%9E%98%ED%95%91";
try {
    Connection connection = Jsoup.connect(url);

    // 403 에러를 피하기 위한 specify user agent
    connection.userAgent("Mozilla/5.0");

    Document document = connection.get(); // get 메서드 요청

    Elements elements = document.getElementsByClass("lst_total"); // ul

    Element ul = elements.get(0); // element = ul

    for (Element li: ul.children()) {
        Elements a = li.getElementsByClass("total_tit");

        String txt = a.text();
        System.out.println(txt);
    }
} catch (IOException e) {
    throw new RuntimeException(e);
}

결과

네이버 view에서 "스크래핑" 검색결과 중 제목만 스크래핑

C:\Users\xh\.jdks\openjdk-19.0.2\bin\java.exe -XX:TieredStopAtLevel=1 -Dspring.output.ansi.enabled=always -Dcom.sun.management.jmxremote -Dspring.jmx.enabled=true -Dspring.liveBeansView.mbeanDomain -Dspring.application.admin.enabled=true "-Dmanagement.endpoints.jmx.exposure.include=*" "-javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2022.3.1\lib\idea_rt.jar=58765:C:\Program Files\JetBrains\IntelliJ IDEA 2022.3.1\bin" -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath C:\Users\xh\Documents\zb-dividend\financial\out\production\classes;C:\Users\xh\Documents\zb-dividend\financial\out\production\resources;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.projectlombok\lombok\1.18.22\9c08ea24c6eb714e2d6170e8122c069a0ba9aacf\lombok-1.18.22.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-starter-web\2.5.6\46b479490170914f7477b96a21241183b181c24d\spring-boot-starter-web-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-starter-data-jpa\2.5.6\8d7fe99c33e09390316749614d9795d80b49207b\spring-boot-starter-data-jpa-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-starter-security\2.5.6\af5827b9e08ea631fa213cccd1144fbdfee32896\spring-boot-starter-security-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.jsoup\jsoup\1.7.2\d7e275ba05aa380ca254f72d0c0ffebaedc3adcf\jsoup-1.7.2.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-starter-json\2.5.6\6ef5a7087e18ed4f3736c8752440ecd489c36a4d\spring-boot-starter-json-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-starter\2.5.6\d5d1fada1afe9a808abf48da7066a993cf679aa\spring-boot-starter-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-starter-tomcat\2.5.6\6d1a04a727d9d09b99207864ceb0a4567e53730a\spring-boot-starter-tomcat-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-webmvc\5.3.12\3d92ad6c28bfa5923183f328f5bfa1e39ec32714\spring-webmvc-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-web\5.3.12\78991a50d17da49bddc4987a2cc8b83d46c402a7\spring-web-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-starter-aop\2.5.6\c5db1260ecf447f55419f1a17da75a42f211aca3\spring-boot-starter-aop-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-starter-jdbc\2.5.6\cf01e787378c2d30b695f0c9f76fb48a6b490984\spring-boot-starter-jdbc-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\jakarta.transaction\jakarta.transaction-api\1.3.3\c4179d48720a1e87202115fbed6089bdc4195405\jakarta.transaction-api-1.3.3.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\jakarta.persistence\jakarta.persistence-api\2.2.3\8f6ea5daedc614f07a3654a455660145286f024e\jakarta.persistence-api-2.2.3.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.hibernate\hibernate-core\5.4.32.Final\99a5e10bf455337014c190e141ec631e9ff71663\hibernate-core-5.4.32.Final.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.data\spring-data-jpa\2.5.6\8e0ec2f54f3fcda49dfb3123f3a40f34b55df92a\spring-data-jpa-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-aspects\5.3.12\3cccc3052c6973c059eb2be7c4baf0b9558d49b7\spring-aspects-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.security\spring-security-web\5.5.3\2d2b773e2af5b5984852db8857a77175ce4e1104\spring-security-web-5.5.3.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.security\spring-security-config\5.5.3\106b6a1af7460d64fab64ba5bbfe3f52f0eec139\spring-security-config-5.5.3.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-aop\5.3.12\882db41939109e96f4c78cd5c0931cc4aebc3d58\spring-aop-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.fasterxml.jackson.datatype\jackson-datatype-jsr310\2.12.5\a0a9870b681a72789c5c6bdc380e45ab719c6aa3\jackson-datatype-jsr310-2.12.5.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.fasterxml.jackson.module\jackson-module-parameter-names\2.12.5\2c85c2036d0851425a260c01eb5f7ddbed1eeb00\jackson-module-parameter-names-2.12.5.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.fasterxml.jackson.datatype\jackson-datatype-jdk8\2.12.5\6b2f79547d217ad50dfc5b57af7444a3aa583b43\jackson-datatype-jdk8-2.12.5.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.fasterxml.jackson.core\jackson-databind\2.12.5\b064cf057f23d3d35390328c5030847efeffedde\jackson-databind-2.12.5.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-autoconfigure\2.5.6\b9f4016180c5242530da465561ff25c7cac14bf3\spring-boot-autoconfigure-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot\2.5.6\d8c6b97fd3182fb6d7d06ebf710cd9ccabc83b89\spring-boot-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.boot\spring-boot-starter-logging\2.5.6\a900356a11b1a41f4277136f1d13ce7a13f43b3c\spring-boot-starter-logging-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\jakarta.annotation\jakarta.annotation-api\1.3.5\59eb84ee0d616332ff44aba065f3888cf002cd2d\jakarta.annotation-api-1.3.5.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-core\5.3.12\662e6536968246af9baa84fbac2d3eb56a04fda9\spring-core-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.yaml\snakeyaml\1.28\7cae037c3014350c923776548e71c9feb7a69259\snakeyaml-1.28.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.apache.tomcat.embed\tomcat-embed-websocket\9.0.54\ae018906cecb818a8c6f2316d7b0793beadf6609\tomcat-embed-websocket-9.0.54.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.apache.tomcat.embed\tomcat-embed-core\9.0.54\34322c731b2394ea13681cfae0be9cd72f46f88d\tomcat-embed-core-9.0.54.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.apache.tomcat.embed\tomcat-embed-el\9.0.54\9edb062d38d0fd8a165289f44b28b3b0e0e11ed7\tomcat-embed-el-9.0.54.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-context\5.3.12\d5f5f044e05109b7f3337ea2cf692fd62d1ecbb6\spring-context-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-beans\5.3.12\caaa1d489bce88d6aa01ddd255ad5046acf8f282\spring-beans-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-expression\5.3.12\50c82e995b3b8e20a3f313b4356237db5a26e14a\spring-expression-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.aspectj\aspectjweaver\1.9.7\158f5c255cd3e4408e795b79f7c3fbae9b53b7ca\aspectjweaver-1.9.7.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-jdbc\5.3.12\957d6ddc80fbf52d965e6af90ddd0dccfed42d7d\spring-jdbc-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.zaxxer\HikariCP\4.0.3\107cbdf0db6780a065f895ae9d8fbf3bb0e1c21f\HikariCP-4.0.3.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.hibernate.common\hibernate-commons-annotations\5.1.2.Final\e59ffdbc6ad09eeb33507b39ffcf287679a498c8\hibernate-commons-annotations-5.1.2.Final.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.jboss.logging\jboss-logging\3.4.2.Final\e517b8a93dd9962ed5481345e4d262fdd47c4217\jboss-logging-3.4.2.Final.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.javassist\javassist\3.27.0-GA\f63e6aa899e15eca8fdaa402a79af4c417252213\javassist-3.27.0-GA.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\net.bytebuddy\byte-buddy\1.10.22\ef45d7e2cd1c600d279704f492ed5ce2ceb6cdb5\byte-buddy-1.10.22.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\antlr\antlr\2.7.7\83cd2cd674a217ade95a4bb83a8a14f351f48bd0\antlr-2.7.7.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.jboss\jandex\2.2.3.Final\d3865101f0666b63586683bd811d754517f331ab\jandex-2.2.3.Final.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.fasterxml\classmate\1.5.1\3fe0bed568c62df5e89f4f174c101eab25345b6c\classmate-1.5.1.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.dom4j\dom4j\2.1.3\a75914155a9f5808963170ec20653668a2ffd2fd\dom4j-2.1.3.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.glassfish.jaxb\jaxb-runtime\2.3.5\a169a961a2bb9ac69517ec1005e451becf5cdfab\jaxb-runtime-2.3.5.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-orm\5.3.12\2881f9e71889b35fa3785bf67706a201cea93004\spring-orm-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.data\spring-data-commons\2.5.6\15a2384f4eaf7fee512fb295174f6c0fb6c55ee1\spring-data-commons-2.5.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-tx\5.3.12\7f2e61a22682baa22ed5bef0724a4386c41477cb\spring-tx-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.slf4j\slf4j-api\1.7.32\cdcff33940d9f2de763bc41ea05a0be5941176c3\slf4j-api-1.7.32.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.security\spring-security-core\5.5.3\82152ffbb7d248e0903732c74e1578317d8dc8de\spring-security-core-5.5.3.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.fasterxml.jackson.core\jackson-annotations\2.12.5\52d929d5bb21d0186fe24c09624cc3ee4bafc3b3\jackson-annotations-2.12.5.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.fasterxml.jackson.core\jackson-core\2.12.5\725e364cc71b80e60fa450bd06d75cdea7fb2d59\jackson-core-2.12.5.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\ch.qos.logback\logback-classic\1.2.6\b09efa852337fa0dd9859614389eec58dc287116\logback-classic-1.2.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.apache.logging.log4j\log4j-to-slf4j\2.14.1\ce8a86a3f50a4304749828ce68e7478cafbc8039\log4j-to-slf4j-2.14.1.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.slf4j\jul-to-slf4j\1.7.32\8a055c04ab44e8e8326901cadf89080721348bdb\jul-to-slf4j-1.7.32.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework\spring-jcl\5.3.12\2b5f5bb4a78af879bd174ceff5226da3f014ab9d\spring-jcl-5.3.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\jakarta.xml.bind\jakarta.xml.bind-api\2.3.3\48e3b9cfc10752fba3521d6511f4165bea951801\jakarta.xml.bind-api-2.3.3.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.glassfish.jaxb\txw2\2.3.5\ec8930fa62e7b1758b1664d135f50c7abe86a4a3\txw2-2.3.5.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.sun.istack\istack-commons-runtime\3.0.12\cbbe1a62b0cc6c85972e99d52aaee350153dc530\istack-commons-runtime-3.0.12.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.springframework.security\spring-security-crypto\5.5.3\45fc09a7a2484ef843a9db4652e6ff984bc2e537\spring-security-crypto-5.5.3.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\ch.qos.logback\logback-core\1.2.6\25be1abb32e870ff042e698a799b56587e0dca9a\logback-core-1.2.6.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\org.apache.logging.log4j\log4j-api\2.14.1\cd8858fbbde69f46bce8db1152c18a43328aae78\log4j-api-2.14.1.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.h2database\h2\1.4.200\f7533fe7cb8e99c87a43d325a77b4b678ad9031a\h2-1.4.200.jar;C:\Users\xh\.gradle\caches\modules-2\files-2.1\com.sun.activation\jakarta.activation\1.2.2\74548703f9851017ce2f556066659438019e7eb5\jakarta.activation-1.2.2.jar kim.zhyun.financial.FinancialApplication
신한 쏠 비대면 서류제출 스크래핑 리얼리뷰
1) 내집마련 신혼부부 디딤돌대출 후기 (자격확인,주택금융공사 스크래핑, 서류준비)
[부동산정책] 특례보금자리론금리, 신청 방법, 확인사항 feat.스크래핑가능서류정리
[단독]"스크래핑 금지 땐 사업 차질"…마이데이터 관련 행정처분 유예 요청
스크래핑 진행으로 무서류대출가능한곳 찾아주셔서 감사합니다
[리스틀리] 웹 크롤링 툴 :: 클릭 몇 번으로 웹 데이터 엑셀 수집 - 웹스크래핑 ???
하나은행 단독 청년내일저축계좌 가입 자격조건 확인(모바일 스크래핑)
비트릭스, ‘스크린 스크래핑’을 활용한 원화 입출금 솔루션 제공
[신용대출]하나원큐 신용대출 한도조회 스크래핑 오류 해결법 총정리 (하나은행 모바일 앱)
광주은행 모바일프라임론 신용대출 후기와 스크래핑 조건
개인사업자신용대출서류 스크래핑해서 올라가는데요.
디딤돌스크래핑
특례보금자리론 신청 방법, 스크래핑 오류 대응 방법
특례보금자리론 다가구주택 신청 스크래핑, 감정평가 안내
스크래핑보드에 신용카드 연동
특례 보금자리론(전세보증금 반환 용도) 신청, 서류 스크래핑 오류 및 해결 방법
세무사랑 전자세금계산서 스크래핑 ㅣ 원클릭택스 부가세 실무
카드 매입 스크래핑 기능 질문드립니다.
세무사랑 프로그램으로 4대보험 자료 당겨와서 급여대장에 반영하기 (스크래핑)
2화. [대출후기1] 특례보금자리론, 대출까지 얼마나 걸릴까?/주택공사어플, 스크래핑, 온라인신청
특례보금자리론 스크래핑 및 서류 제출에 대한 모든것
RPA에서 스크래핑을 잘 활용하려면
RPA 스크래핑의 뜻
지능 스크래핑 도구 ScrapeStorm과 LISTLY의 기능 비교 분석
필요 서류, 주택금융공사, 신한은행, 스크래핑, 과정 요약, 잔금일, 매매 계약서, 신혼부부, 고정금리...
기웅정보통신, 스크래핑 기반 ‘보험금 자동청구’ 특허 취득
[세무사랑] 스크래핑(원클릭택스)
RPA 스크래핑 기능중에 이미지 추출 가능한가요?
토스 핀크 비상금대출 스크래핑 거절나면 어디로 알아봐야할까?
주택금융공사 소득스크래핑 미제출시

종료 코드 0(으)로 완료된 프로세스

Connection 을 통해 가져온 값에서 Elements를 가져오는 방법이 정말 다양하게 있었다.

스크래핑 시 필요한 항목으로 적절히 사용하면 되겠다! 🙈

profile
Hello velog! 

0개의 댓글