{"id":2148,"date":"2018-07-30T16:14:37","date_gmt":"2018-07-30T16:14:37","guid":{"rendered":"http:\/\/blogs.oregonstate.edu\/gemmlab\/?p=2148"},"modified":"2018-07-30T16:26:44","modified_gmt":"2018-07-30T16:26:44","slug":"big-data-big-possibilities-with-bigger-challenges","status":"publish","type":"post","link":"https:\/\/blogs.oregonstate.edu\/gemmlab\/2018\/07\/30\/big-data-big-possibilities-with-bigger-challenges\/","title":{"rendered":"Big Data: Big possibilities with bigger challenges"},"content":{"rendered":"<p><strong>By <a href=\"https:\/\/mmi.oregonstate.edu\/people\/alexa-kownacki\">Alexa Kownacki<\/a>, Ph.D. Student, OSU Department of Fisheries and Wildlife, Geospatial Ecology of Marine Megafauna Lab<\/strong><\/p>\n<p>Did you know that Excel has a maximum number of rows? I do. During Winter Term for my GIS project, I was using Excel to merge oceanographic data, from a publicly-available data source website, and Excel continuously quit. Naturally, I assumed I had caused some sort of computer error. [As an aside, I\u2019ve concluded that most problems related to technology are human error-based.] Therefore, I tried reformatting the data, restarting my computer, the program, etc. Nothing. Then, thanks to the magic of Google, I discovered that Excel allows no more than 1,048,576 rows by 16,384 columns. ONLY 1.05 million rows?! The oceanography data was more than 3 million rows\u2014and that\u2019s with me eliminating data points. This is what happens when we\u2019re dealing with big data.<\/p>\n<p>According to Merriam-Webster dictionary, big data is an accumulation of data that is too large and complex for processing by traditional database management tools (<a href=\"http:\/\/www.merriam-webster.com\">www.merriam-webster.com<\/a>). However, there are journal articles, like <a href=\"https:\/\/www.forbes.com\/sites\/gilpress\/2014\/09\/03\/12-big-data-definitions-whats-yours\/#2d2a308d13ae\">this one<\/a> from <em>Forbes,<\/em> that discuss the ongoing debate of how to define \u201cbig data\u201d. According to the article, there are 12 major definitions; so, I\u2019ll let you decide what you qualify as \u201cbig data\u201d. Either way, I think that when Excel reaches its maximum row capacity, I\u2019m working with big data.<\/p>\n<figure id=\"attachment_2159\" aria-describedby=\"caption-attachment-2159\" style=\"width: 660px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blogs.oregonstate.edu\/gemmlab\/files\/2018\/07\/IMG_9239-e1532967993146.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-2159\" src=\"http:\/\/blogs.oregonstate.edu\/gemmlab\/files\/2018\/07\/IMG_9239-834x1024.jpg\" alt=\"\" width=\"660\" height=\"810\" \/><\/a><figcaption id=\"caption-attachment-2159\" class=\"wp-caption-text\">Collecting oceanography data aboard the R\/V Shimada. Photo source: Alexa K.<\/figcaption><\/figure>\n<p>Here\u2019s the thing: the oceanography data that I referred to was just a snippet of my data. Technically, it\u2019s not even MY data; it&#8217;s data I accessed from <a href=\"https:\/\/coastwatch.pfeg.noaa.gov\/erddap\/index.html\">NOAA\u2019s ERDDAP website<\/a> that had been consistently observed for the time frame of <a href=\"https:\/\/mmi.oregonstate.edu\/gemm-lab\/comparative-health-assessment-bottlenose-dolphin-ecotypes-california\">my dolphin data points<\/a>. You may recall <a href=\"http:\/\/blogs.oregonstate.edu\/gemmlab\/2018\/03\/13\/land-maps-charts-geospatial-ecology\/\">my blog about maps and geospatial analysis<\/a> that highlights some of the reasons these variables, such as temperature and salinity, are important. However, what I didn\u2019t previously mention was that I spent weeks working on editing this NOAA data. My project on common bottlenose dolphins overlays environmental variables to better understand dolphin population health off of California. These variables should have similar spatiotemporal attributes as the dolphin data I\u2019m working with, which has a time series beginning in the 1980s. Without taking out a calculator, I still know that equates to a lot of data. Great data: data that will let me answer interesting, pertinent questions. But, big data nonetheless.<\/p>\n<p>This is a screenshot of what the oceanography data looked like when I downloaded it to Excel. This format repeats for nearly 3 million rows.<\/p>\n<figure id=\"attachment_2151\" aria-describedby=\"caption-attachment-2151\" style=\"width: 660px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blogs.oregonstate.edu\/gemmlab\/files\/2018\/07\/Screen-Shot-2018-02-15-at-11.01.49-PM-e1532965481515.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-2151\" src=\"http:\/\/blogs.oregonstate.edu\/gemmlab\/files\/2018\/07\/Screen-Shot-2018-02-15-at-11.01.49-PM-1024x553.png\" alt=\"\" width=\"660\" height=\"356\" \/><\/a><figcaption id=\"caption-attachment-2151\" class=\"wp-caption-text\">Excel Screen Shot. Image source: Alexa K.<\/figcaption><\/figure>\n<p>I showed this Excel spreadsheet to my GIS professor, and his response was something akin to \u201choly smokes\u201d, with a few more expletives and a look of horror. It was not the sheer number of rows that shocked him; it was the data format. Nowadays, nearly everyone works with big data. It\u2019s par for the course. However, the way data are formatted is the major split between what I\u2019ll call \u201ceasy\u201d data and \u201chard\u201d data. The oceanography data could have been \u201ceasy\u201d data. It could have had many variables listed in columns. Instead, this data \u00a0alternated between rows with variable headings and columns with variable headings, for millions of cells. And, as described earlier, this is only one example of big data and its challenges.<\/p>\n<p>Data does not always come in a form with text and numbers; sometimes it appears as media such as photographs, videos, and audio files. Big data just got a whole lot bigger. While working as a scientist at <a href=\"https:\/\/swfsc.noaa.gov\/\">NOAA\u2019s Southwest Fisheries Science Center<\/a>, one project brought in over 80 terabytes of raw data per year. <a href=\"https:\/\/www.fisheries.noaa.gov\/feature-story\/automatic-whale-detector-version-10\">The project<\/a> centered on the eastern north pacific gray whale population, and, more specifically, its migration. Scientists have observed the gray whale migration annually since 1994 from <a href=\"http:\/\/www.piedrasblancas.org\/index.html\">Piedras Blancas Light Station<\/a> for the Northbound migration, and 2 out of every 5 years from <a href=\"http:\/\/www.granitecanyon.org\/\">Granite Canyon Field Station<\/a> (GCFS) for the Southbound migration. One of my roles was to ground-truth software that would help transition from humans as observers to computer as observers. One avenue we assessed was to compare how well a computer \u201ccounted\u201d whales compared to people. For this question, three infrared cameras at the GCFS recorded during the same time span that human observers were counting the migratory whales. Next, scientists, such as myself, would transfer those video files, upwards of 80 TB, from the hard drives to Synology boxes and to a different facility&#8211;miles away. Synology boxes store arrays of hard drives and that can be accessed remotely. To review, three locations with 80 TB of the same raw data. Once the data is saved in triplet, then I could run a computer program, to detect whale. In summary, three months of recorded infrared video files requires upwards of 240 TB before processing. This is big data.<\/p>\n<figure style=\"width: 2048px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/scontent-lax3-2.xx.fbcdn.net\/v\/t31.0-8\/12622189_10153360956907584_6429535768999151180_o.jpg?_nc_cat=0&amp;oh=3b6f79cea4c71bd191c8c21dec60ee05&amp;oe=5C04056C\" alt=\"\" width=\"2048\" height=\"1278\" \/><figcaption class=\"wp-caption-text\">Scientists on an observation shift at Granite Canyon Field Station in Northern California. Photo source: Alexa K.<\/figcaption><\/figure>\n<figure style=\"width: 2048px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"spotlight\" src=\"https:\/\/scontent-lax3-2.xx.fbcdn.net\/v\/t31.0-8\/18738702_10154606610482584_6662076668239505992_o.jpg?_nc_cat=0&amp;oh=b3d3f1d30ff3e40c08887c7d4604676f&amp;oe=5BD36ACC\" alt=\"\" width=\"2048\" height=\"1262\" \/><figcaption class=\"wp-caption-text\">Alexa and another NOAA scientist watching for gray whales at Piedras Blancas Light Station. Photo source: Alexa K.<\/figcaption><\/figure>\n<p>In the <a href=\"https:\/\/mmi.oregonstate.edu\/gemm-lab\">GEMM Laboratory<\/a>, we have so many sources of data that I did not bother trying to count. I\u2019m entering my second year of the Ph.D. program and I already have a hard drive of data that I\u2019ve backed up three different locations. It\u2019s no longer a matter of \u201cif\u201d you work with big data, it\u2019s \u201chow\u201d. How will you format the data? How will you store the data? How will you maintain back-ups of the data? How will you share this data with collaborators\/funders\/the public?<\/p>\n<p>The wonderful aspect to big data is in the name: big and data. The scientific community can answer more, in-depth, challenging questions because of access to data and more of it. Data is often the limiting factor in what researchers can do because increased sample size allows more questions to be asked and greater confidence in results. That, and <a href=\"https:\/\/securelb.imodules.com\/s\/359\/foundation\/index.aspx?sid=359&amp;gid=34&amp;pgid=1982&amp;bledit=1&amp;cid=3007&amp;dids=451&amp;x=51&amp;y=12\">funding of course<\/a>. It\u2019s the reason why when you see GEMM Lab members in the field, we\u2019re not only using drones to capture aerial images of whales, we\u2019re taking fecal, biopsy, and phytoplankton samples. We\u2019re recording the location, temperature, water conditions, wind conditions, cloud cover, date\/time, water depth, and so much more. Because all of this data will help us and help other scientists answer critical questions. Thus, to my fellow scientists, I feel your pain and I applaud you, because I too know that the challenges that come with big data are worth it. And, to the non-scientists out there, hopefully this gives you some insight as to why we scientists ask for external hard drives as gifts.<\/p>\n<figure id=\"attachment_2153\" aria-describedby=\"caption-attachment-2153\" style=\"width: 660px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/blogs.oregonstate.edu\/gemmlab\/files\/2018\/07\/IMG_4747-e1532966347744.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-2153\" src=\"http:\/\/blogs.oregonstate.edu\/gemmlab\/files\/2018\/07\/IMG_4747-768x1024.jpg\" alt=\"\" width=\"660\" height=\"880\" \/><\/a><figcaption id=\"caption-attachment-2153\" class=\"wp-caption-text\">Leila launching the drone to collect aerial images of gray whales to measure body condition. Photo source: Alexa K.<\/figcaption><\/figure>\n<figure style=\"width: 2048px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"spotlight\" src=\"https:\/\/scontent-lax3-2.xx.fbcdn.net\/v\/t31.0-8\/21427230_10154902462682584_1887752882706633729_o.jpg?_nc_cat=0&amp;oh=f08cc7e625e737e6e7bc8bdf4e89a6c1&amp;oe=5BDA0302\" alt=\"\" width=\"2048\" height=\"1365\" \/><figcaption class=\"wp-caption-text\">Using the theodolite to collect tracking data on the Pacific Coast Feeding Group in Port Orford, OR. Photo source: Alexa K.<\/figcaption><\/figure>\n<p>References:<\/p>\n<p><a href=\"https:\/\/support.office.com\/en-us\/article\/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3\">https:\/\/support.office.com\/en-us\/article\/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3<\/a><\/p>\n<p><a href=\"https:\/\/www.merriam-webster.com\/dictionary\/big%20data\">https:\/\/www.merriam-webster.com\/dictionary\/big%20data<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Alexa Kownacki, Ph.D. Student, OSU Department of Fisheries and Wildlife, Geospatial Ecology of Marine Megafauna Lab Did you know that Excel has a maximum number of rows? I do. During Winter Term for my GIS project, I was using Excel to merge oceanographic data, from a publicly-available data source website, and Excel continuously quit. &hellip; <a href=\"https:\/\/blogs.oregonstate.edu\/gemmlab\/2018\/07\/30\/big-data-big-possibilities-with-bigger-challenges\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Big Data: Big possibilities with bigger challenges<\/span><\/a><\/p>\n","protected":false},"author":8612,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[1011750,1],"tags":[1237694,1211813,643004,135571,2064,173914,916414,112406,215873,1237691,1368,635445,1237698,97168,1237692,634945,1237696,712919,336,799,155,97272,1212663,1237576,993645,1237695,1237697,214860],"class_list":["post-2148","post","type-post","status-publish","format-standard","hentry","category-bottlenose-dolphin-population-health","category-uncategorized","tag-aerial-images","tag-alexa-kownacki","tag-big-data","tag-data","tag-data-collection","tag-dolphins","tag-drone","tag-excel","tag-fieldwork","tag-forbes","tag-funding","tag-gemm-lab","tag-geospatial-analysis","tag-gis","tag-granit-canyon","tag-gray-whales","tag-hard-drives","tag-leila-lemos","tag-noaa","tag-oceanography","tag-oregon-state-university","tag-phd","tag-phd-student","tag-piedras-blancas","tag-port-orford","tag-synology","tag-theodolite","tag-uas"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"post_mailing_queue_ids":[],"_links":{"self":[{"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/posts\/2148","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/users\/8612"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/comments?post=2148"}],"version-history":[{"count":8,"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/posts\/2148\/revisions"}],"predecessor-version":[{"id":2160,"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/posts\/2148\/revisions\/2160"}],"wp:attachment":[{"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/media?parent=2148"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/categories?post=2148"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.oregonstate.edu\/gemmlab\/wp-json\/wp\/v2\/tags?post=2148"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}