I chose to analyze Biodiversity Data from Various National Parks across the US. I am interested to see how diverse a park is compared to another national park, If the biodiversity of a certain park, or in general all the parks has gone up or down in the past years. I am also interested in the percentage of animal species in different parks compared to other parks. For example knowing the percentage of mammals in one park compared to the percentage of mammals in another park. I want to create data visualizations of the percentage of species in each park. I also want to check on the percentage of invasive species and percentage of native species in each national park.
The Data that I decided to use is from a dataset called: Biodiversity in National Parks. I chose this dataset, because it has a ton of data on different animals from their respective national parks including: Park Name, Category,Order,Family,Scientific Name,Common Names,Occurrence,Nativeness, etc.
This data will require a bit of pre processesing, such as checking for NaN values, null values, empty values. According to the dataset for Species.csv, There is a 17% amount of null values in Occurrence and a 21% amount of nulls in the Nativeness column. These nulls will most likely need to be taken out and replaced with a placeholder data in order to not mess the data up. There is a weird column named Unamed: 13 which will have to be taken out as for some reason it seems to be a column that has just NaNs in all of its rows and doesn't really seem to be useful. Other Columns such as Nativeness and Abundance have a worrying amount of Missing Values, however Seasonality has an 83% Missing Values in the entire column so there may be reason to remove this column. I would also sometimes have to clean the data by putting in unknowns instead of NaN values
As I compared data, I wanted to compare the biodiversity of species in one park vs another.