diff --git a/notebooks/VISDOM/CSV Data to MySQL for use in VISDOM.ipynb b/notebooks/VISDOM/CSV Data to MySQL for use in VISDOM.ipynb index d858b08..2d0586c 100644 --- a/notebooks/VISDOM/CSV Data to MySQL for use in VISDOM.ipynb +++ b/notebooks/VISDOM/CSV Data to MySQL for use in VISDOM.ipynb @@ -542,7 +542,7 @@ { "data": { "text/plain": [ - "" + "" ] }, "execution_count": 13, @@ -812,11 +812,22 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 18, "metadata": { - "collapsed": true + "collapsed": false }, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "accounts_df['site_pair'] = accounts_df['school_city'] + \"_\" + accounts_df['school_site_name']\n", "\n", @@ -856,16 +867,9 @@ "### Creating the meter_data table in the desired format" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: this is currently treating the meter_uuid, account_uuid, date and zip5 as integers, but they should more likely be treated as varchar, varchar, datetime and varchar, respectively." - ] - }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 19, "metadata": { "collapsed": false }, @@ -873,10 +877,10 @@ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 18, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -915,7 +919,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 20, "metadata": { "collapsed": true }, @@ -928,7 +932,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 21, "metadata": { "collapsed": true }, @@ -941,7 +945,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 22, "metadata": { "collapsed": false }, @@ -1354,6 +1358,1230 @@ " print \"failed sql insert. meter_uuid:\" + str(df['meter_uuid'][0]) + \", filename: \" + f.split(\"_Pacific\")[0] + \"...\"" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Causes of sql insert errors" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Spot checks of these insert errors suggest that they (or at least many of them) are caused by attempting to insert rows that replicate the table's primary key, which is a combination of meter_uuid and date. Looking at the csv files for some of these errors do show that they are replicated data in some pairs of csv files. Below is an example, which covers the first sql insert error noted above:\n", + "\n", + "```\n", + "failed sql insert. meter_uuid:8496493494, filename: 01100176106751_20122013...\n", + "```\n", + "\n", + "This is caused by these two files being identical (this can also be verified with a text diff tool):\n", + "\n", + "```\n", + "PGE_csv/2012-2013/Electricity/01100176106751_20122013_PacificGasElectric_ELECTRIC_20151104.xml_INTERVAL.csv\n", + "PGE_csv/2012-2013/Electricity/01100170130419_20122013_PacificGasElectric_ELECTRIC_20151104.xml_INTERVAL.csv\n", + "```\n", + "\n", + "Digging further to see if the original xml files are different suggests that the files are not exactly the same (a text diff tool says they are different, but since they treat the xml as all one line it is difficult to tell how they are different), but inspecting their data trees manually suggests that they may indeed contain the same data. " + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agreementstartd1d2d3d4d5d6d7d8...d87d88d89d90d91d92d93d94d95d96
08496493494134112600029.94029.24428.58428.46428.33227.74428.14029.460...32.85632.50830.32430.36031.23632.55631.80030.69630.30030.000
18496493494134121240030.92429.72430.98428.78829.64030.42030.15631.092...31.81231.96829.05229.04030.12029.71230.44429.13623.71222.128
28496493494134129880021.50421.75623.11222.03221.57621.60021.84021.660...25.77624.67223.52024.73224.20419.7520.0000.0000.0000.000
3849649349413413852005.70034.41626.67624.33623.85623.88023.60423.628...24.73224.24022.59622.70422.35622.74022.51223.54422.06821.780
48496493494134147160021.03620.67621.42022.48820.97621.27620.83220.508...30.07235.19625.69224.36025.00830.63622.56023.07624.72022.440
\n", + "

5 rows × 98 columns

\n", + "
" + ], + "text/plain": [ + " agreement start d1 d2 d3 d4 d5 d6 \\\n", + "0 8496493494 1341126000 29.940 29.244 28.584 28.464 28.332 27.744 \n", + "1 8496493494 1341212400 30.924 29.724 30.984 28.788 29.640 30.420 \n", + "2 8496493494 1341298800 21.504 21.756 23.112 22.032 21.576 21.600 \n", + "3 8496493494 1341385200 5.700 34.416 26.676 24.336 23.856 23.880 \n", + "4 8496493494 1341471600 21.036 20.676 21.420 22.488 20.976 21.276 \n", + "\n", + " d7 d8 ... d87 d88 d89 d90 d91 d92 \\\n", + "0 28.140 29.460 ... 32.856 32.508 30.324 30.360 31.236 32.556 \n", + "1 30.156 31.092 ... 31.812 31.968 29.052 29.040 30.120 29.712 \n", + "2 21.840 21.660 ... 25.776 24.672 23.520 24.732 24.204 19.752 \n", + "3 23.604 23.628 ... 24.732 24.240 22.596 22.704 22.356 22.740 \n", + "4 20.832 20.508 ... 30.072 35.196 25.692 24.360 25.008 30.636 \n", + "\n", + " d93 d94 d95 d96 \n", + "0 31.800 30.696 30.300 30.000 \n", + "1 30.444 29.136 23.712 22.128 \n", + "2 0.000 0.000 0.000 0.000 \n", + "3 22.512 23.544 22.068 21.780 \n", + "4 22.560 23.076 24.720 22.440 \n", + "\n", + "[5 rows x 98 columns]" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "root = \"PGE_csv/2012-2013/Electricity\"\n", + "f = \"01100176106751_20122013_PacificGasElectric_ELECTRIC_20151104.xml_INTERVAL.csv\"\n", + "df = pd.read_csv(os.path.join(root,f), usecols=usecols)\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agreementstartd1d2d3d4d5d6d7d8...d87d88d89d90d91d92d93d94d95d96
08496493494134112600029.94029.24428.58428.46428.33227.74428.14029.460...32.85632.50830.32430.36031.23632.55631.80030.69630.30030.000
18496493494134121240030.92429.72430.98428.78829.64030.42030.15631.092...31.81231.96829.05229.04030.12029.71230.44429.13623.71222.128
28496493494134129880021.50421.75623.11222.03221.57621.60021.84021.660...25.77624.67223.52024.73224.20419.7520.0000.0000.0000.000
3849649349413413852005.70034.41626.67624.33623.85623.88023.60423.628...24.73224.24022.59622.70422.35622.74022.51223.54422.06821.780
48496493494134147160021.03620.67621.42022.48820.97621.27620.83220.508...30.07235.19625.69224.36025.00830.63622.56023.07624.72022.440
\n", + "

5 rows × 98 columns

\n", + "
" + ], + "text/plain": [ + " agreement start d1 d2 d3 d4 d5 d6 \\\n", + "0 8496493494 1341126000 29.940 29.244 28.584 28.464 28.332 27.744 \n", + "1 8496493494 1341212400 30.924 29.724 30.984 28.788 29.640 30.420 \n", + "2 8496493494 1341298800 21.504 21.756 23.112 22.032 21.576 21.600 \n", + "3 8496493494 1341385200 5.700 34.416 26.676 24.336 23.856 23.880 \n", + "4 8496493494 1341471600 21.036 20.676 21.420 22.488 20.976 21.276 \n", + "\n", + " d7 d8 ... d87 d88 d89 d90 d91 d92 \\\n", + "0 28.140 29.460 ... 32.856 32.508 30.324 30.360 31.236 32.556 \n", + "1 30.156 31.092 ... 31.812 31.968 29.052 29.040 30.120 29.712 \n", + "2 21.840 21.660 ... 25.776 24.672 23.520 24.732 24.204 19.752 \n", + "3 23.604 23.628 ... 24.732 24.240 22.596 22.704 22.356 22.740 \n", + "4 20.832 20.508 ... 30.072 35.196 25.692 24.360 25.008 30.636 \n", + "\n", + " d93 d94 d95 d96 \n", + "0 31.800 30.696 30.300 30.000 \n", + "1 30.444 29.136 23.712 22.128 \n", + "2 0.000 0.000 0.000 0.000 \n", + "3 22.512 23.544 22.068 21.780 \n", + "4 22.560 23.076 24.720 22.440 \n", + "\n", + "[5 rows x 98 columns]" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "root = \"PGE_csv/2012-2013/Electricity\"\n", + "f = \"01100170130419_20122013_PacificGasElectric_ELECTRIC_20151104.xml_INTERVAL.csv\"\n", + "df = pd.read_csv(os.path.join(root,f), usecols=usecols)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To see how widespread this duplication in the underlying csv data is in causing the sql insert errors, consider comparisons of all of their top lines via pandas and identifying duplicates as shown below. This suggests that 353 of the 383 sql insert errors may indeed be caused by duplication in the csv data. The identified duplicates pertain to 165 different meter_uuids." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "top_lines = []\n", + "file_names = []\n", + "for root, dirs, files in os.walk(csv_dir):\n", + " for f in files:\n", + " if f.endswith('_INTERVAL.csv'):\n", + " df = pd.read_csv(os.path.join(root,f), usecols=usecols)\n", + " if len(df) > 0:\n", + " file_names.append(f)\n", + " top_lines.append(df.to_dict(orient='records')[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agreementd1d10d11d12d13d14d15d16d17...d89d9d90d91d92d93d94d95d96start
08.496493e+0929.940028.99227.2428.77628.34429.172027.660028.872027.8640...30.32427.660030.3631.23632.55631.8030.69630.3030.001.341126e+09
18.496493e+0929.940028.99227.2428.77628.34429.172027.660028.872027.8640...30.32427.660030.3631.23632.55631.8030.69630.3030.001.341126e+09
25.637200e+095.56005.1605.085.0005.2005.08005.12005.08005.0800...5.4405.16005.485.4005.4005.445.4005.325.361.341126e+09
35.637200e+090.24000.2400.240.3600.2400.24000.36000.24000.2400...0.3600.36000.240.3600.2400.360.2400.240.361.341126e+09
45.637199e+097.23367.3207.327.1287.1287.04647.04646.95686.9568...0.0007.41440.000.0000.0000.000.0000.000.001.355990e+09
\n", + "

5 rows × 98 columns

\n", + "
" + ], + "text/plain": [ + " agreement d1 d10 d11 d12 d13 d14 d15 \\\n", + "0 8.496493e+09 29.9400 28.992 27.24 28.776 28.344 29.1720 27.6600 \n", + "1 8.496493e+09 29.9400 28.992 27.24 28.776 28.344 29.1720 27.6600 \n", + "2 5.637200e+09 5.5600 5.160 5.08 5.000 5.200 5.0800 5.1200 \n", + "3 5.637200e+09 0.2400 0.240 0.24 0.360 0.240 0.2400 0.3600 \n", + "4 5.637199e+09 7.2336 7.320 7.32 7.128 7.128 7.0464 7.0464 \n", + "\n", + " d16 d17 ... d89 d9 d90 d91 d92 \\\n", + "0 28.8720 27.8640 ... 30.324 27.6600 30.36 31.236 32.556 \n", + "1 28.8720 27.8640 ... 30.324 27.6600 30.36 31.236 32.556 \n", + "2 5.0800 5.0800 ... 5.440 5.1600 5.48 5.400 5.400 \n", + "3 0.2400 0.2400 ... 0.360 0.3600 0.24 0.360 0.240 \n", + "4 6.9568 6.9568 ... 0.000 7.4144 0.00 0.000 0.000 \n", + "\n", + " d93 d94 d95 d96 start \n", + "0 31.80 30.696 30.30 30.00 1.341126e+09 \n", + "1 31.80 30.696 30.30 30.00 1.341126e+09 \n", + "2 5.44 5.400 5.32 5.36 1.341126e+09 \n", + "3 0.36 0.240 0.24 0.36 1.341126e+09 \n", + "4 0.00 0.000 0.00 0.00 1.355990e+09 \n", + "\n", + "[5 rows x 98 columns]" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "top_lines_df = pd.DataFrame.from_records(top_lines)\n", + "top_lines_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agreementd1d10d11d12d13d14d15d16d17...d9d90d91d92d93d94d95d96startfilename
08.496493e+0929.940028.99227.2428.77628.34429.172027.660028.872027.8640...27.660030.3631.23632.55631.8030.69630.3030.001.341126e+0901100170130419_20122013_PacificGasElectric_ELE...
18.496493e+0929.940028.99227.2428.77628.34429.172027.660028.872027.8640...27.660030.3631.23632.55631.8030.69630.3030.001.341126e+0901100176106751_20122013_PacificGasElectric_ELE...
25.637200e+095.56005.1605.085.0005.2005.08005.12005.08005.0800...5.16005.485.4005.4005.445.4005.325.361.341126e+0901612340000000_20122013_PacificGasElectric_ELE...
35.637200e+090.24000.2400.240.3600.2400.24000.36000.24000.2400...0.36000.240.3600.2400.360.2400.240.361.341126e+0901612340116301_20122013_PacificGasElectric_ELE...
45.637199e+097.23367.3207.327.1287.1287.04647.04646.95686.9568...7.41440.000.0000.0000.000.0000.000.001.355990e+0901612340130054_20122013_PacificGasElectric_ELE...
\n", + "

5 rows × 99 columns

\n", + "
" + ], + "text/plain": [ + " agreement d1 d10 d11 d12 d13 d14 d15 \\\n", + "0 8.496493e+09 29.9400 28.992 27.24 28.776 28.344 29.1720 27.6600 \n", + "1 8.496493e+09 29.9400 28.992 27.24 28.776 28.344 29.1720 27.6600 \n", + "2 5.637200e+09 5.5600 5.160 5.08 5.000 5.200 5.0800 5.1200 \n", + "3 5.637200e+09 0.2400 0.240 0.24 0.360 0.240 0.2400 0.3600 \n", + "4 5.637199e+09 7.2336 7.320 7.32 7.128 7.128 7.0464 7.0464 \n", + "\n", + " d16 d17 ... \\\n", + "0 28.8720 27.8640 ... \n", + "1 28.8720 27.8640 ... \n", + "2 5.0800 5.0800 ... \n", + "3 0.2400 0.2400 ... \n", + "4 6.9568 6.9568 ... \n", + "\n", + " d9 d90 d91 d92 d93 d94 d95 d96 start \\\n", + "0 27.6600 30.36 31.236 32.556 31.80 30.696 30.30 30.00 1.341126e+09 \n", + "1 27.6600 30.36 31.236 32.556 31.80 30.696 30.30 30.00 1.341126e+09 \n", + "2 5.1600 5.48 5.400 5.400 5.44 5.400 5.32 5.36 1.341126e+09 \n", + "3 0.3600 0.24 0.360 0.240 0.36 0.240 0.24 0.36 1.341126e+09 \n", + "4 7.4144 0.00 0.000 0.000 0.00 0.000 0.00 0.00 1.355990e+09 \n", + "\n", + " filename \n", + "0 01100170130419_20122013_PacificGasElectric_ELE... \n", + "1 01100176106751_20122013_PacificGasElectric_ELE... \n", + "2 01612340000000_20122013_PacificGasElectric_ELE... \n", + "3 01612340116301_20122013_PacificGasElectric_ELE... \n", + "4 01612340130054_20122013_PacificGasElectric_ELE... \n", + "\n", + "[5 rows x 99 columns]" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "top_lines_df['filename'] = file_names\n", + "top_lines_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agreementd1d10d11d12d13d14d15d16d17...d9d90d91d92d93d94d95d96startfilename
08.496493e+0929.9428.99227.24028.77628.34429.17227.66028.87227.864...27.6630.3631.23632.55631.8030.69630.3030.001.341126e+0901100170130419_20122013_PacificGasElectric_ELE...
18.496493e+0929.9428.99227.24028.77628.34429.17227.66028.87227.864...27.6630.3631.23632.55631.8030.69630.3030.001.341126e+0901100176106751_20122013_PacificGasElectric_ELE...
55.637199e+094.965.6005.6005.6805.4405.6005.5205.6805.600...5.525.445.7605.6005.364.7204.324.321.341126e+0901612340130484_20122013_PacificGasElectric_ELE...
65.637199e+094.965.6005.6005.6805.4405.6005.5205.6805.600...5.525.445.7605.6005.364.7204.324.321.341126e+0901612340135426_20122013_PacificGasElectric_ELE...
182.657846e+090.004.3683.8163.6403.1523.5444.6644.5363.616...4.08NaNNaNNaNNaNNaNNaNNaN1.365664e+0903739810000000_20122013_PacificGasElectric_ELE...
\n", + "

5 rows × 99 columns

\n", + "
" + ], + "text/plain": [ + " agreement d1 d10 d11 d12 d13 d14 d15 \\\n", + "0 8.496493e+09 29.94 28.992 27.240 28.776 28.344 29.172 27.660 \n", + "1 8.496493e+09 29.94 28.992 27.240 28.776 28.344 29.172 27.660 \n", + "5 5.637199e+09 4.96 5.600 5.600 5.680 5.440 5.600 5.520 \n", + "6 5.637199e+09 4.96 5.600 5.600 5.680 5.440 5.600 5.520 \n", + "18 2.657846e+09 0.00 4.368 3.816 3.640 3.152 3.544 4.664 \n", + "\n", + " d16 d17 ... d9 \\\n", + "0 28.872 27.864 ... 27.66 \n", + "1 28.872 27.864 ... 27.66 \n", + "5 5.680 5.600 ... 5.52 \n", + "6 5.680 5.600 ... 5.52 \n", + "18 4.536 3.616 ... 4.08 \n", + "\n", + " d90 d91 d92 d93 d94 d95 d96 start \\\n", + "0 30.36 31.236 32.556 31.80 30.696 30.30 30.00 1.341126e+09 \n", + "1 30.36 31.236 32.556 31.80 30.696 30.30 30.00 1.341126e+09 \n", + "5 5.44 5.760 5.600 5.36 4.720 4.32 4.32 1.341126e+09 \n", + "6 5.44 5.760 5.600 5.36 4.720 4.32 4.32 1.341126e+09 \n", + "18 NaN NaN NaN NaN NaN NaN NaN 1.365664e+09 \n", + "\n", + " filename \n", + "0 01100170130419_20122013_PacificGasElectric_ELE... \n", + "1 01100176106751_20122013_PacificGasElectric_ELE... \n", + "5 01612340130484_20122013_PacificGasElectric_ELE... \n", + "6 01612340135426_20122013_PacificGasElectric_ELE... \n", + "18 03739810000000_20122013_PacificGasElectric_ELE... \n", + "\n", + "[5 rows x 99 columns]" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "duplicated_top_lines = top_lines_df[top_lines_df.duplicated(subset=usecols, keep=False)]\n", + "duplicated_top_lines.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(645, 292, 165)" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(duplicated_top_lines), len(duplicated_top_lines[['agreement','start']].drop_duplicates()), len(duplicated_top_lines[['agreement']].drop_duplicates()), " + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "353" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(duplicated_top_lines) - len(duplicated_top_lines[['agreement','start']].drop_duplicates())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "note that there are 383 failed inserts above, so this duplication of 353 top-rows could explain most of them\n", + "\n", + "the first 5 fails are:\n", + "\n", + "```\n", + "failed sql insert. meter_uuid:8496493494, filename: 01100176106751_20122013...\n", + "failed sql insert. meter_uuid:5637199274, filename: 01612340135426_20122013...\n", + "failed sql insert. meter_uuid:5741361055, filename: 04615230123687_20122013...\n", + "failed sql insert. meter_uuid:1626781530, filename: 06616220630038_20122013...\n", + "failed sql insert. meter_uuid:6449693283, filename: 10739651030402_20122013...\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
agreementfilename
08.496493e+0901100170130419_20122013_PacificGasElectric_ELE...
18.496493e+0901100176106751_20122013_PacificGasElectric_ELE...
55.637199e+0901612340130484_20122013_PacificGasElectric_ELE...
65.637199e+0901612340135426_20122013_PacificGasElectric_ELE...
182.657846e+0903739810000000_20122013_PacificGasElectric_ELE...
262.657846e+0903739816107395_20122013_PacificGasElectric_ELE...
275.741361e+0904615230000000_20122013_PacificGasElectric_ELE...
295.741361e+0904615230123687_20122013_PacificGasElectric_ELE...
322.084624e+0906616220118729_20122013_PacificGasElectric_ELE...
332.084624e+0906616220630038_20122013_PacificGasElectric_ELE...
\n", + "
" + ], + "text/plain": [ + " agreement filename\n", + "0 8.496493e+09 01100170130419_20122013_PacificGasElectric_ELE...\n", + "1 8.496493e+09 01100176106751_20122013_PacificGasElectric_ELE...\n", + "5 5.637199e+09 01612340130484_20122013_PacificGasElectric_ELE...\n", + "6 5.637199e+09 01612340135426_20122013_PacificGasElectric_ELE...\n", + "18 2.657846e+09 03739810000000_20122013_PacificGasElectric_ELE...\n", + "26 2.657846e+09 03739816107395_20122013_PacificGasElectric_ELE...\n", + "27 5.741361e+09 04615230000000_20122013_PacificGasElectric_ELE...\n", + "29 5.741361e+09 04615230123687_20122013_PacificGasElectric_ELE...\n", + "32 2.084624e+09 06616220118729_20122013_PacificGasElectric_ELE...\n", + "33 2.084624e+09 06616220630038_20122013_PacificGasElectric_ELE..." + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "duplicated_top_lines[['agreement','filename']].head(10)" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1363,7 +2591,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 23, "metadata": { "collapsed": false }, @@ -1496,7 +2724,7 @@ "[3 rows x 100 columns]" ] }, - "execution_count": 22, + "execution_count": 23, "metadata": {}, "output_type": "execute_result" } @@ -1507,7 +2735,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 24, "metadata": { "collapsed": false }, @@ -1537,7 +2765,7 @@ "0 1286757" ] }, - "execution_count": 23, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } @@ -1557,7 +2785,7 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": 25, "metadata": { "collapsed": false }, @@ -1565,10 +2793,10 @@ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 24, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } @@ -1642,7 +2870,7 @@ { "data": { "text/plain": [ - "" + "" ] }, "execution_count": 27,