{ "cells": [ { "cell_type": "markdown", "metadata": { "exercise": "task", "tags": [ "task" ] }, "source": [ "# Data Analysis and Plotting in Python with Pandas\n", "\n", " _Carolin Penke, J\u00fclich Supercomputing Centre, Forschungszentrum J\u00fclich, 23 October 2024_\n", " _Based on material by Andreas Herten_" ] }, { "cell_type": "markdown", "metadata": { "exercise": "onlytask", "slideshow": { "slide_type": "skip" } }, "source": [ "**Version: Tasks**" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "fragment" } }, "source": [ "## Task Outline\n", "\n", "* [Task 1](#task1)\n", "* [Task 2](#task2)\n", "* [Task 3](#task3)\n", "* [Task 4](#task4)\n", "* [Task 5](#task5)\n", "* [Task 6](#task6)\n", "* [Task 7](#task7)\n", "* [Task 7B](#task7b)\n", "* [Task 8](#task8)\n", "* [Task 8B](#task8b)" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "exercise": "task", "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "slide" } }, "source": [ "## Task 1\n", "\n", "TASK\n", "\n", "* Create data frame with\n", " - 6 names of dinosaurs, \n", " - their favourite prime number, \n", " - and their favorite color.\n", "* Play around with the frame\n", "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d" ] }, { "cell_type": "markdown", "metadata": { "exercise": "nopresentation", "slideshow": { "slide_type": "skip" } }, "source": [ "Jupyter Notebook 101:\n", "\n", "* Execute cell: `shift+enter`\n", "* New cell in front of current cell: `a`\n", "* New cell after current cell: `b`" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "exercise": "task", "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "happy_dinos = {\n", " \"Dinosaur Name\": [],\n", " \"Favourite Prime\": [],\n", " \"Favourite Color\": []\n", "}\n", "#df_dinos = " ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "slide" } }, "source": [ "## Task 2\n", "\n", "TASK\n", "\n", "* Read in `data-nest.csv` to `DataFrame`; call it `df` \n", " *(Data was produced with [JUBE](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html))*\n", "* Get to know it and play a bit with it\n", "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d" ] }, { "cell_type": "code", "execution_count": 151, "metadata": { "exercise": "task" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id,Nodes,Tasks/Node,Threads/Task,Runtime Program / s,Scale,Plastic,Avg. Neuron Build Time / s,Min. Edge Build Time / s,Max. Edge Build Time / s,Min. Init. Time / s,Max. Init. Time / s,Presim. Time / s,Sim. Time / s,Virt. Memory (Sum) / kB,Local Spike Counter (Sum),Average Rate (Sum),Number of Neurons,Number of Connections,Min. Delay,Max. Delay\r\n", "5,1,2,4,420.42,10,true,0.29,88.12,88.18,1.14,1.20,17.26,311.52,46560664.00,825499,7.48,112500,1265738500,1.5,1.5\r\n", "5,1,4,4,200.84,10,true,0.15,46.03,46.34,0.70,1.01,7.87,142.97,46903088.00,802865,7.03,112500,1265738500,1.5,1.5\r\n", "5,1,2,8,202.15,10,true,0.28,47.98,48.48,0.70,1.20,7.95,142.81,47699384.00,802865,7.03,112500,1265738500,1.5,1.5\r\n", "5,1,4,8,89.57,10,true,0.15,20.41,23.21,0.23,3.04,3.19,60.31,46813040.00,821491,7.23,112500,1265738500,1.5,1.5\r\n", "5,2,2,4,164.16,10,true,0.20,40.03,41.09,0.52,1.58,6.08,114.88,46937216.00,802865,7.03,112500,1265738500,1.5,1.5\r\n", "5,2,4,4,77.68,10,true,0.13,20.93,21.22,0.16,0.46,3.12,52.05,47362064.00,821491,7.23,112500,1265738500,1.5,1.5\r\n", "5,2,2,8,79.60,10,true,0.20,21.63,21.91,0.19,0.47,2.98,53.12,46847168.00,821491,7.23,112500,1265738500,1.5,1.5\r\n", "5,2,4,8,37.20,10,true,0.13,10.08,11.60,0.10,1.63,1.24,23.29,47065232.00,818198,7.33,112500,1265738500,1.5,1.5\r\n", "5,3,2,4,96.51,10,true,0.15,26.54,27.41,0.36,1.22,3.33,64.28,52256880.00,813743,7.27,112500,1265738500,1.5,1.5\r\n" ] } ], "source": [ "!head data-nest.csv" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Task 3\n", "\n", "TASK\n", "\n", "* Add a column to the Nest data frame form Task 2 called `Threads` which is the total number of threads across all nodes (i.e. the product of threads per task and tasks per node and nodes)\n", "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d" ] }, { "cell_type": "code", "execution_count": 182, "metadata": { "exercise": "task", "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Task 4\n", "\n", "TASK\n", "\n", "\n", "* Sort the Nest data frame by threads\n", "* Plot `\"Presim. Time / s\"` and `\"Sim. Time / s\"` of our data frame `df` as a function of threads\n", "* Use a dashed, red line for `\"Presim. Time / s\"`, a blue line for `\"Sim. Time / s\"` (see [API description](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot))\n", "* Don't forget to label your axes and to add a legend _(1st rule of plotting)_\n", "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "slide" } }, "source": [ "## Task 5\n", "\n", "TASK\n", "\n", "Use the Nest data frame `df` to:\n", "\n", "1. Make threads index of the data frame (`.set_index()`)\n", "2. Plot `\"Presim. Time / s\"` and `\"Sim. Time / s`\" individually\n", "3. Plot them onto one common canvas!\n", "4. Make them have the same line colors and styles as before\n", "5. Add a legend, add missing axes labels\n", "6. Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "slide" } }, "source": [ "## Task 6\n", "\n", "TASK\n", "\n", "* To your `df` Nest data frame, add a column with the unaccounted time (`Unaccounted Time / s`), which is the difference of program runtime, average neuron build time, minimal edge build time, minimal initialization time, presimulation time, and simulation time. \n", "(*I know this is technically not super correct, but it will do for our example.*)\n", "* Plot a stacked bar plot of all these columns (except for program runtime) over the threads\n", "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Task 7\n", "\n", "TASK\n", "\n", "* Create a pivot table based on the Nest `df` data frame\n", "* Let the `x` axis show the number of nodes; display the values of the simulation time `\"Sim. Time / s\"` for the tasks per node and threads per task configurations\n", "* Please plot a bar plot\n", "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "## Task 7B (like Bonus)\n", "\n", "TASK\n", "\n", "- Same pivot table as before (that is, `x` with nodes, and columns for Tasks/Node and Threads/Task)\n", "- But now, use `Sim. Time / s` and `Presim. Time / s` as values to show\n", "- Show them as a **stack** of those two values inside the pivot table\n", "- Use Panda's functionality as much as possible!" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Task 8 (Super Bonus)\n", "\n", "TASK\n", "\n", "* Create bar chart of top 10 actors (on `x`) and average ratings of their top movies (`y`) based on IMDb data (only if they play in at least two movies)" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "fragment" }, "tags": [] }, "source": [ "* IMDb provides data sets at [datasets.imdbws.com](https://datasets.imdbws.com)\n", "* Can directly be loaded like\n", "```python\n", "pd.read_table('https://datasets.imdbws.com/dataset.tsv.gz', sep=\"\\t\", low_memory=False, na_values=[\"\\\\N\",\"nan\"])\n", "```\n", "* Needed:\n", " * `name.basics.tsv.gz` (for names of actors and movies they are known for)\n", " * `title.ratings.tsv.gz` (for ratings of titles)\n", "* Strategy _suggestions_:\n", " * Use `df.apply()` with custom function\n", " * Custom function: Compute average rating and determine if this entry is eligible for plotting (this _can_ be done at once, but does not need to be)\n", " * Average rating: Look up title IDs as listed in `knownForTitles` in titles dataframe" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task", "slideshow": { "slide_type": "slide" }, "tags": [] }, "source": [ "## Task 8B (Bonuseption)\n", "\n", "TASK\n", "\n", "All of the following are ideas for unique sub-tasks, which can be done individually\n", "* In addition to Task 8, restrict the top titles to those with more than 10000 votes\n", "* For 30 top-rated actors, plot rating vs. age\n", "* For 30 top-rated actors, plot rating vs. average runtime of the known-for-titles (using `title.basics.tsv.gz`)" ] }, { "cell_type": "markdown", "metadata": { "exercise": "task" }, "source": [ "Feedback to a.herten@fz-juelich.de\n", "\n", "_Next slide: Further reading_" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" }, "toc-autonumbering": false, "toc-showcode": false, "toc-showmarkdowntxt": false, "toc-showtags": true }, "nbformat": 4, "nbformat_minor": 4 }