The Dormouse's story

{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "IntrotoML-Lecture-6_2.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "4vknzA5K1dq2" }, "source": [ "## For more details about the BeautifulSoup documentation go to: [click here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n", "\n", "### Before you begin\n", "If running locally you need to make sure that you have beautifulsoup4 installed. \n", "`conda install beautifulsoup4` or \n", "`pip install beautifulsoup4`\n", "\n", "It should already be installed on colab. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "2E_vSEPJ2nPf" }, "source": [ "# All html documents have structure. Here, we can see a basic html page. " ] }, { "cell_type": "code", "metadata": { "id": "X9wgJCq91drI" }, "source": [ "html_doc = \"\"\"\n", "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "\n", "

...

\n", "\"\"\"" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Tt20VJZp1drO" }, "source": [ "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "

...

\n", "" ] }, { "cell_type": "code", "metadata": { "id": "j8GsdPrC1drP" }, "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(html_doc, 'html.parser')\n", "\n", "print(soup.prettify())\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "WTbE3qYh1drZ" }, "source": [ "### A Retreived Beautiful Soup Object \n", "- Can be parsed via dot notation to travers down the hierarchy by *class name*, *tag name*, *tag type*, etc.\n", "\n" ] }, { "cell_type": "code", "metadata": { "scrolled": true, "id": "Z_fKqIc81drb" }, "source": [ "soup" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "YivtiYOy1dri", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "cf3cd811-6565-4cab-ec29-5135d9006b93" }, "source": [ "#Select the title class.\n", "soup.title\n", " \n" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "The Dormouse's story" ] }, "metadata": { "tags": [] }, "execution_count": 4 } ] }, { "cell_type": "code", "metadata": { "id": "OHSL7iLZ1drn", "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "outputId": "91cb624e-5fcd-4dbc-c207-a8bdcfb653cb" }, "source": [ "#Name of the tag.\n", "soup.title.name\n", "\n", "\n" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'title'" ] }, "metadata": { "tags": [] }, "execution_count": 5 } ] }, { "cell_type": "code", "metadata": { "id": "pc4BUJ_21drt", "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "outputId": "129303a8-6e56-4559-ca01-0d71c9461bd2" }, "source": [ "#String contence inside the tag\n", "soup.title.string\n", "\n", "\n" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "\"The Dormouse's story\"" ] }, "metadata": { "tags": [] }, "execution_count": 6 } ] }, { "cell_type": "code", "metadata": { "id": "JJIaljuP1drw", "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "outputId": "113fd4ef-351f-4de7-a8de-2132c2af5a44" }, "source": [ "#Parent in hierarchy.\n", "soup.title.parent.name\n", "\n", "\n" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'head'" ] }, "metadata": { "tags": [] }, "execution_count": 7 } ] }, { "cell_type": "code", "metadata": { "id": "9aHRkzDA1dr0", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9cb7e031-5d3a-4228-83e4-5721d34b6c2d" }, "source": [ "#List the first p tag.\n", "soup.p\n", "\n", "\n" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "

The Dormouse's story

" ] }, "metadata": { "tags": [] }, "execution_count": 8 } ] }, { "cell_type": "code", "metadata": { "id": "Q8cZlFCc1dr4", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9754afa3-a9de-4c9d-f86d-e367293a971e" }, "source": [ "#List the class of the first p tag.\n", "soup.p['class']\n", "\n", "\n" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['title']" ] }, "metadata": { "tags": [] }, "execution_count": 9 } ] }, { "cell_type": "code", "metadata": { "id": "CJG0TbO-1dr9", "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "outputId": "325646ee-9929-4fef-e86c-fb2ffb21fd99" }, "source": [ "#List the class of the first a tag.\n", "soup.a\n", "\n", "soup.a.string\n", "\n", "\n" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'Elsie'" ] }, "metadata": { "tags": [] }, "execution_count": 10 } ] }, { "cell_type": "code", "metadata": { "id": "6vbhMqmr1dsB", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "9efee05d-8ea3-4b64-8a0a-be7f709ca3b1" }, "source": [ "#List all a tags.\n", "soup.find_all('a')\n", "\n" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "metadata": { "tags": [] }, "execution_count": 11 } ] }, { "cell_type": "code", "metadata": { "id": "PaERPV3s1dsL" }, "source": [ "vals = soup.find_all(\"a\")\n", "for eachval in vals: \n", " print(eachval.string)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ZpAKma8p1dsO" }, "source": [ "import requests\n", "#The requests module allows you to send HTTP requests using Python.\n", "response = requests.get(\"https://en.wikipedia.org/robots.txt\")\n", "txt = response.text\n", "print(txt)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "VfJ_AUlF1dsR" }, "source": [ "response = requests.get(\"https://www.rpi.edu\")\n", "txt = response.text\n", "soup = BeautifulSoup(txt, 'html.parser')\n", "\n", "print(soup.prettify())" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "4nZAjkNZ1dsX" }, "source": [ "soup.find_all('a')" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "AJmuDqLd1dsq" }, "source": [ "# Experiment with selecting your own website. Selecting out a url. \n", "\n", "response = requests.get(\"http://news.baidu.com/\")\n", "txt = response.text\n", "print(txt)\n", "soup = BeautifulSoup(txt, 'html.parser')\n", "\n", "print(soup.prettify())" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "KscfZ2UBak-e" }, "source": [ "#print the top-5 keywords based on their frequency from the text\r\n", " # associated with href links from the url 'https://www.rpi.edu' \r\n", "#for example: dummy website we will utilize\r\n", "#\"dummy website\" \r\n", "\r\n", "import requests\r\n", "from bs4 import BeautifulSoup\r\n", "import operator\r\n", "\r\n", "#The Robots.txt listing who is allowed.\r\n", "response = requests.get(\"https://www.rpi.edu\")\r\n", "txt = response.text\r\n", "#print(txt)\r\n", "\r\n", "\r\n", "soup = BeautifulSoup(txt, 'html.parser')\r\n", "#If you want all the text: \r\n", "#print(soup.get_text())\r\n", "\r\n", "\r\n", "#print(soup.prettify())\r\n", "\r\n", "diction={}\r\n", "\r\n", "vals = soup.find_all(\"a\")\r\n", "for eachval in vals:\r\n", " str1 = eachval.string\r\n", " print(str1)\r\n", " if str1 is not None:\r\n", " substr = (str1.lower()).split(\" \")\r\n", " for ss in substr:\r\n", " try:\r\n", " diction[ss] +=1\r\n", " except: \r\n", " diction[ss] = 1\r\n", "\r\n", "#print(allwords)\r\n", "#print(diction)\r\n", "sorted_x = sorted(diction.items(), key=operator.itemgetter(1), reverse=True)\r\n", "\r\n", "#print(sorted_x)\r\n", "for k in range(5):\r\n", " print(sorted_x[k][0]+ \": \"+ str(sorted_x[k][1])) \r\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "jRGoXbvMwYkm" }, "source": [ "#### Breakout groups exercise -- 2 problems\r\n", "\r\n" ] }, { "cell_type": "markdown", "metadata": { "id": "4izLql2a4C81" }, "source": [ "#### Q1. \r\n", "\r\n", "In the previous example, we utilized the text associated with \r\n", "anchor tags. Now, utilize all the text present in the url to build a dictionary 'diction1' where keys are the words and values are their corresponding frequencies. Then find the top-10 keywords. \r\n", "\r\n", "Hints:\r\n", "- Break the text as line by line -- '\\n'\r\n", "- Then convert each line or sentence into lowercase\r\n", "- Remove all non-alphanumeric characters except 1 whitespace\r\n", "- then split the sentence into words to do the remaining operations" ] }, { "cell_type": "markdown", "metadata": { "id": "wIJdPiBD6IGj" }, "source": [ "####Q2 \r\n", "\r\n", "Now repeat the above process to build another dictionary 'diction2' using another webpage 'https://lallyschool.rpi.edu/' where keys are words in this webpage and corresponding values are their frequencies. \r\n", "\r\n", "Now build a pandas dataframe df using these two dictionaries with corresponding column names as 'diction1' and 'diction2' with index as the words. " ] } ] }