Pages

Monday, October 31, 2016

Data Modeling with Neo4j: “School Immunization in California” CSV to Graph

1 state, over 9 million children, and 42,981 rows of CSV immunization data. After many rough drafts, I was finally able to land on an efficient and aesthetically pleasing way to map out the immunization data of children in California (found and downloaded online from the California Department of
Education*).In this post our goal is to walk through the data modeling process to show how this CSV data can be connected meaningfully with Neo4j. What makes this data so interesting is its varying degrees of location, three distinct grade levels, and a dense record of immunization numbers and percentages-all spanning over two separate school years.
After successfully mapping the data, I could then easily explore it, answering questions such as: Where in California has the lowest amount of children vaccinated?, Are less parents vaccinating their children in 2015 compared to 2014?, and Which age group is more up to date on its vaccinations?. Furthermore, I was able to clearly visualize the data in small and large quantities using the neo4j graph.
* The public school dataset can be found at http://www.cde.ca.gov/ds/si/ds/pubschls.asp . The private school dataset can be found at http://www.cde.ca.gov/ds/si/ps/ .

Preparing the Files

Before anything else, we have to modify the CSVs by:
  • Changing the column names: capitalize only the first letter of each word and leave the rest of the word in lower case.
  • Replacing # with Number, % with Percent, 1 with One, + with Plus, and taking out any dashes and parentheses. More simply, spell out the symbols and numbers found in the headers.

Mapping the Data: Determining the Relationships Between Locations and Schools

Before getting into the code, I took pen to paper by simply drawing out the nodes and their relationships to 

Read More......

No comments:

Post a Comment