The most important skill

This can be modeled with the equation log(ExamScore) = 3.75(log(NumberOfStudents))-0.02(SEDPercent).

In order to create that fact table, with that grain, we’ll need 243kb of storage per record.

The way the MEDIAN function is implemented in this software, the execution time of the process grows exponentially.

For practitioners in the various domains that these statements come from, they’re all easily interpretable and easy to understand. For those outside of those domains, each one is pretty impenetrable. Worse than being hard to understand, each of these statements leaves almost as much important material unsaid as said.

If clear communication is the most important skill in business, than the most important skill for a data scientist – or arguably any scientist – is the ability to take complex topics and reduce them into material that is easy for a layman, can understand.  Even more, these people must be able to understand you well enough to take effective and timely action based on the information that you are relaying to them.

In medicine they call it “bedside manner,” and in business they call them “soft skills,” but in both domains, these skills are critical for success. In many ways they are what distinguish the merely great practitioners from those who are at the absolute pinnacle of their field. One of the clearest examples of this is Dr. Stephen Hawking. Dr. Hawking’s contributions to the theory of astrophysics cannot be denied, but he is most well known for his ability to explain extremely advanced topics to a popular audience.

Data scientists, particularly those that do not have the opportunity to work for organizations that can maintain large departments of them to work together, will most often be communicating with people whose specialties lie in other fields. In order to do that effectively, they must be able to render the statements above into something like the statements below.

  • For every 1 percent increase in NumberOfStudents, ExamScore increases by 3.802 percent and for every 1 increase in SEDPercent, ExamScore decreases by 2.0201 percent. This means that for schools with similar rates of socioeconomic disadvantage, larger ones do better.
  • We’re storing a lot of data per row here, and storage at these volumes can get quite expensive. To keep this much data with this frequency for this many subjects, we need to make plan to have 50 TB of storage. This will probably cost around $50K once redundancy and backup are taken into account.
  • Because of the number of subject’s in this dataset, using the median is going to cause the system to run extremely slowly.

The statements above should be usable and understandable for just about anyone. They don’t use jargon or technical language, and they don’t assume prior knowledge of the subject by the listener. A great data scientist will strive to know their audience and move between the two extremes as appropriate.

Unknown's avatar

Author: Jason Miles

A solution-focused developer, engineer, and data specialist focusing on diverse industries. He has led data products and citizen data initiatives for almost twenty years and is an expert in enabling organizations to turn data into insight, and then into action. He holds MS in Analytics from Texas A&M, DAMA CDMP Master, and INFORMS CAP-Expert credentials.