Becoming a Hybrid AI Developer/Scientist

One of the most popular discussion topics these days is AI developer vs AI scientist. Rather than switching from one to the other, I offer simple advice to complement your current skillset to be more polyvalent. For the developer who wants to become a scientist, there a few trade secrets you can easily learn so that your team does not need to hire a data scientist and can rely on you instead. For the scientist, I explain what to learn to have much more efficient and positive interactions with developers, for the benefit of your company. Up to taking full ownership on some development projects. In both cases, I focus on the minimum to become a functional hybrid developer/scientist.

Developer: How to Gain Scientist Skills

I worked with many developers. They can perform most common data science tasks such as clustering or predictive analytics. They may not know the details about these algorithms. But these days, it’s just a couple of lines of code using Python libraries. You can find the code online or ask GPT. Indeed, even for modern deep neural network techniques. In the end, many of these analyses will be automated; even data scientists will stop spending weeks on such projects, just mere hours instead.

Yet, developers sometimes do not have a good statistical intuition. Or they may not know how to produce great visualizations for stakeholders. That is, visualizations summarizing what is important and efficiently conveying the correct message. Here I focus on the most common problems. They are easy to fix. Now, learn the tricks to become a true full-stack developer who can compete with scientists!

Generic Example

I illustrate these tricks when designing and implementing an evaluation metric: The KS distance described here.

  • Before any machine learning project, always shuffle the observations in your dataset, unless the natural order is important, as in time series.
  • Split your original data into training and validation sets. Test and evaluate on the validation set, that is, the observations not used to train your model.
  • Try various datasets. Does it work on categorical data? When the number of features is large? When the number of observations is small? How does it handle zip codes and timestamps? Smart encoding (see here) is useful to deal with complex multivariate categories.
  • If computing KS on many small datasets, you should expect to get some false positives (good results flagged as bad). Perform simulations to estimate the number of false positives occurring naturally due to probability laws, and check if you are within standard range.
  • The KS distance is computed using a large number of random nodes and SQL queries. Reasonably increase the number of nodes, if necessary. You want decent accuracy while not overusing CPU or GPU.
  • When testing a new model, benchmark against the base model or baseline. Does the model outperform naive, static predictions? By how much? Consider solutions not based on neural networks.
  • KS is sensitive to the dimension: The number of features. If you use a version that does not correct for the dimension, how would you notice that your KS is not standardized, and how would you standardize it?
Additional advice

Some Python libraries have limitations. Scientists sometimes write their own, better libraries, with few lines of code and in little time. For instance, you cannot generate data outside the observation range using Numpy quantiles. See my solution here. Sci-Kit clustering uses a similarity matrix and is unfit for sparse data or text clustering. See my solution here. There are better alternatives to dot product and cosine similarity, see here. Same with stopwords and stemming from NLTK. A workaround is to use do-not-stop lists.

Many classical statistics and techniques have limitations or drawbacks. You can ignore p-values, do your own statistical tests and confidence intervals using simulations, use my generic regression technique (here) rather than learning dozens of disparate methods, and ignore the dozens of eclectic metrics attached to a confusion matrix. Yet, it is important to know the difference between L1 and L2

Finally, to determine optimum sample size, perform simulations based on sub-samples of increasing sizes. Also, group small buckets or bin data to reduce granularity. Statistics computed on small buckets are not reliable. Last but not least: document! In simple English with diagrams, code line numbers and illustrations when possible.

Visualizations

Perhaps the most important piece of advice here, is to generate comparable plots. Too many times, I see sublots with different scales for the X-axis, the Y-axis, or both. With a different range in each subplot. And different bin widths or number of bins. It makes comparisons difficult for the expert, and misleading for the layman. Use the largest range as “common denominator”. Also, if the distributions are skewed, think about using a log-transform before plotting. Finally, in a scatterplot, points may be hidden due to overlap: use color transparency.

Scientist: How to Gain Developer Skills

My first advice is to adopt the mindset of a developer: focus on simple methods when possible, easy to test, scale, and maintain. Also, consistently deliver decent results on time, as opposed to reaching for perfection. Then, encapsulate your code: make it easier for developers to productize. Finally, if you are working on your PhD, offer to teach programming classes, and get a part-time job as part of your program. Mine was in statistics; I worked (paid) for an image remote sensing company, the topic of my thesis. My mentor helped me get the job, which was mostly development and engineering, working on enterprise data to test my models.

After completion, I did my military obligations, managing the local “production” database for a small military base, scheduling the exercises and guards for all the staff. Thousands of lines of SQL code to maintain and upgrade. You could offer to do the same for a non-profit. It helped me land my first real job in the corporate world, for a startup. There I learned how to automate most of my tasks with cron jobs, doing essentially engineering work. I benefited from head winds from the beginning. If this is not your case, here are some of my recommendations.

  • Create a public Web API or SDK that accepts datasets as inputs, process them, and return results to the user. Design your app to support 100k users per day. Write good documentation so that users don’t need to contact you for help. See examples here.
  • Develop and maintain your own Python library on PyPi. Again, with good documentation.
  • Write a smart crawler to parse millions of webpages. Design it so that it can resume from where it stopped in case of crash, and revisit URLs that failed on the first pass. Optimize speed. Use distributed architecture. Work on maintenance and augmentation.
  • Work with IT in your company to get permission to create your own, local production environment. Or do it at home as a hobby. Automate your tasks.
  • Teach classes on programming languages. Learn good programming practices while preparing your classes and offer a collaborative environment to students.
  • Find enterprise datasets to work with or create your own (synthetic data). Stress-test your algorithms on these complex datasets. Identify bottlenecks in your algorithms, and fix them.
  • Test your code in multiple environments. Be aware of the version of each library that you use, and dependencies. Master versioning, git, virtual environments, and Docker.
  • Learn how to automate data cleaning and deal with missing values. Master error handling.

There are plenty of enterprise projects, datasets, and case studies to choose from, in my new book “State of the Art in GenAI & LLMs — Creative Projects, with Solution”, available here. It is written for developers as well as scientists interested in moving into engineering and development.

Fuente: Data Science Central