15:04:57 [Moderator] Can you describe the area of machine learning that you work in?
15:05:05 I have been working on machine learning emulation of dynamical downscaling from lower resolution global climate model to higher resolution regional climate model.
15:05:22 [Moderator] And how long have you been working in that specific area?
15:05:26 Like a year and a half.
15:05:34 [Moderator] Can you describe the type of training data that you work with, such as image, text, tabular data, et cetera?
15:05:41 So the data that we're working with is climate model outputs, which are time-space fields that are regularly sampled.
15:05:51 These are gridded data sets that You know, the They've got a grid that is the model grid that is, you know, has spatial extent to it and is the output is regular in time and it's it's model outputs So it's
15:06:12 Completely uniform in coverage and structure.
15:06:20 [Moderator] The data sets that you use for training then, are they made from scratch or are they repurposed from other sources?
15:06:27 These are repurposed from other sources. They were they were. So the two data sets in particular that we're using are the ERA5 reanalysis, which is sort of our best estimate of the historical state of the atmosphere based on running a weather forecast model
15:06:49 That is continuously sort of steered to match up with observations as best as possible.
15:06:56 And then, um. A data product that was generated by taking a regional climate model and using that other model as the input and downscaling it to four kilometers over the U.S., which is the spatial resolution where there are some processes that are now
15:07:16 Explicitly resolved instead of being parameterized with approximations. And that was done for, you know, the first one was done just for general weather use. The second one was done for hydrology use in particular, but all of these climate data sets are generated
15:07:38 With the intent that lots of different people will use them in lots of different applications.
15:07:46 Which is a way that climate science is different than a lot of other scientific fields.
15:07:53 [Moderator] And do you create any of the metadata for the data that you process or receive?
15:07:59 Um…
15:08:04 Mostly no, but there's an asterisk there because I had to take one of the data sets and I had to clean it up.
15:08:12 Which involved adding some metadata but doing that, I'm following a community standard for metadata.
15:08:23 [Moderator] And would you mind explaining what that community standard is or naming it?
15:08:28 The CF metadata standard is a metadata standard for climate and forecasting data. That's the C and the F.
15:08:36 Which is it's a set of rules about how you should encode metadata into your NetCDF files and when everybody does it this way, which most people do.
15:08:44 It enables you to use it enables software to be smart about these things and people have done a lot of really cool software infrastructure on top of the standard.
15:08:57 [Moderator] Moving on to the section for searching for data. How often are you searching for data for machine learning purposes within your work?
15:09:05 Like once per project. So hardly ever.
15:09:14 [Moderator] Now, what tools and platforms are you just aware of to search for geospatial machine learning data sets such as Hugging Face?
15:09:24 I have heard of Hugging Face. Um… I… I am not really aware of any platforms that are that are specifically aimed at like weather and climate data sets.
15:09:42 I'm aware of a lot of sources for weather and climate data sets, but I'm not aware of any of those that are for machine learning specifically.
15:09:54 [Moderator] Now, what platforms and tools do you use? Not just aware of, but do you use for search for geospatial machine learning data sets?
15:10:03 Oh. Mostly just Google because the kind of data sets that I'm looking for are generally You know, you say, okay, I want a data set with these characteristics. And the answer is there are two or less of those. Pick one.
15:10:22 Or sometimes the answer is that doesn't exist.
15:10:26 [Moderator] Now, how do you search in that case just for Google in general? Do you use tags for things like language, size, format, or do you filter by things maybe even such as like popular or top search results, things like that.
15:10:44 I would be searching for characteristics like spatial resolution and spatial coverage and time domain time domain time coverage and what variables are in the data set.
15:11:02 [Moderator] Now, can you walk me through the last search that you can recall for a geospatial machine learning data set? This is the part where if you can go ahead and share your screen, and also if you wouldn't mind talking through the steps of your search.
15:11:19 Um… tricky. Sure. Well, so… So for this project in particular, there wasn't really any search.
15:11:25 [Moderator] As best you can recall, of course.
15:11:32 It was, we have this data set. And we want to do we started from the data set of like, we have this data set. We want to do a machine learning emulation of the process that generated this data set.
15:11:47 So we can generate similar data sets more cheaply.
15:11:52 So there wasn't really any search. What I will… say is that, here, let me
15:12:03 Share. So, you know, if I were to What I might do is if I were looking for something about this data set or for a similar data set is I would probably go to the [redacted], which is the [redacted], this is for [redacted]. So these are all data sets that
15:12:27 Already exist on our system and I don't have to download them.
15:12:31 I could just you know get them I could just get a pointer to where they are on our file system, you know, and I would search for something like, okay, I'm interested in the [redacted] data set.
15:12:42 Great. This is the one. But then what I might be doing is I might be looking for things like, oh, please tell me what you know, what's the spatial coverage of it? Oh, okay, cool.
15:12:53 Or, you know, what are all the what are all the variables in this?
15:13:00 What's the Let's see, resources now.
15:13:07 It's documentation. You know data access.
15:13:11 Tell me where tell me where these sets of files are listed and things like that.
15:13:16 Um… Another… sort of a related example.
15:13:23 Is, you know, I might be I might be looking for something like, okay, gridded climate data observations I want something that has wind, right?
15:13:37 Like, okay, here's okay here's Climate Data Guide is pretty good.
15:13:46 So I haven't used it in ages. Okay, apparently it's not working right now.
15:13:51 Might find a list of things here and I would sort of scroll through it and read things until I say, oh, that looks like the one I want.
15:14:00 Let's see. I'm trying to remember how i got to one of the data sets that I used for some related analysis most recently.
15:14:13 And where did I… Find it.
15:14:19 Well, I guess basically the way I found it is that you know I found some list.
15:14:26 Here's some data sets that exist. And then I said, okay.
15:14:29 What's this one? Gridmet. Great. Okay. And then I go to the website and I say Here are the characteristics of this. Oh, it's four kilometers. Great. Contiguous you know, 1979 to yesterday, Daily, fantastic. That sounds like what I need.
15:14:52 And then, you know, the only other question would be like what variables does it have. And if we go here you go, find… scripts that would be like, here's how you can download all the data, I download it once and then
15:15:11 Proceed to use it for the next 20 years.
15:15:19 So. Okay.
15:15:18 [Moderator] You can stop sharing, but I appreciate you sharing that. That is helpful.
15:15:26 Yeah. Yes.
15:15:25 [Moderator] We're going to move to interoperability. So can you describe any challenges you've had in using data from different data set platforms? This would be...an example here would be huggingface and Kaggle.
15:15:38 [Moderator] But it does not have to be those.
15:15:40 Yeah. So the data that I'm using is generally coming from
15:15:54 Most of the climate data, the climate data that I use is basically pretty much always in NetCDF format.
15:16:01 There's one specific hurdle with NetCDF format that I have run into recently.
15:16:07 Which is that NetCDF has something called chunking that it has to do with the way that the data is organized inside the file.
15:16:15 And if the chunking is bad, it dramatically decreases the speed with which you can read data off disk and hand it off to the GPU.
15:16:30 Oh, wow.
15:16:24 Like two orders of magnitude. Or more. And so that's something that is really, really important But that quite possibly none of the data providers even have any idea what chunking they have used.
15:16:48 So that's… That's with regard to climate data.
15:16:51 I've done some other stuff with wildfire data and its relationship to climate. And this is something where what I've been doing is statistical modeling, but I could very well imagine also trying to do some machine learning stuff with that. And in that case
15:17:09 The interoperability is, yeah people you know distribute things in um formats like GIS or Excel spreadsheets and you have to you have to read them in. slash clean them up. And the answer is I write code in R to ingest those and convert them into a format
15:17:30 That I can use. And that's just that's part of the process.
15:17:36 But the interoperability there is that you have data coming from two very different disciplines. And the nature of the data is that it's it's the nature and the structure of the data is really different. The climate data is space-time fields. We have we have
15:18:01 Fields that exist throughout time and space and we sample them at some interval And so, you know, I have something like here's what the temperature over north, what the temperature of the air at the surface of the Earth was over North America.
15:18:17 Sampled at quarter degree intervals and I've got it hourly or daily or whatever. And then the other thing that I've got that I'm trying to get it to talk to each other is the other thing is.
15:18:27 I have a list of big fires and when they started and when they ended and how how how much area they took up and maybe maybe a latitude longitude coordinate for, you know, that's associated with it.
15:18:43 And maybe not. And so those are just like part of the scientific problem is figuring out how to relate those two things to each other.
15:18:56 Um so the the fact that they are stored in very different formats is just kind of like a just a small piece of the problem.
15:19:13 Yeah.
15:19:09 [Moderator] We're going to move to a new section, trust in data. How do you determine what is relevant to your research using information like metadata, title, contents, data cards?
15:19:26 Let's see.
15:19:33 That is…
15:19:46 That's generally going to be, What is the chain of provenance that generated this data product?
15:19:55 Because… you know, we're pretty much never working with something that is direct instrument outputs or something like that. It's always something where either somebody ran a model and I'm just using the output from that model. Or there's been some kind of
15:20:15 Process, which might involve a model where people took instrument readings or observational data and combine them into a a data product that is a product uniform in space and time.
15:20:33 Or for example, one of these fire data sets was, you know, the there are these reports that get submitted when people are doing fire management activities, and a researcher I know collected them all up and, you know, put them into
15:20:52 A database. So really the question is, is like where did this data come from? That's what sort of tells you how whether it's... That's the important question for trustworthiness is where did it come from? How did it get where it got to?
15:21:10 [Moderator] And that feeds into my next question follow-up, which is how do you eventually pick a data set for machine learning? Is it the quality, the usability, things like number of likes, number of downloads, word of mouth even?
15:21:21 It's almost entirely going to be characteristics of the data. So, you know, like I said, if I want gridded data that is daily and that covers a decent chunk of North America.
15:21:36 And is… fairly high resolution um and has variables other than just temperature and precipitation in it.
15:21:47 There are two choices. That's it.
15:21:53 So, you know, it's almost entirely a matter of finding a data set that has the characteristics you need In order to answer the research question that you want and the answer is usually, okay.
15:22:08 You got a couple of options at most. And sometimes you say, well, I'll just use both of them.
15:22:14 Sometimes you say, well, this is the only one, so I'll use that.
15:22:18 Um you know and and sometimes it's like well The thing I really want doesn't exist. What's the closest I can get And then what do I need to do to get it to what I actually need?
15:22:32 [Moderator] Do you read the data cards of data sets?
15:22:36 I don't even know what a data card is.
15:22:38 [Moderator] Here, it's just a simple definition I have is just a structured summary of essential facts about various aspects of ML datasets needed by stakeholders across the lifecycle for things like responsible AI development, describing the content, design, evaluation.
15:22:54 [Moderator] Things like that.
15:22:56 No, because I do not have the luxury of being able to choose based on that kind of information.
15:23:09 [Moderator] I had some follow-up questions, but I believe you've already addressed these. What attributes would you look for in data cards? And you mentioned several attributes that you were interested in.
15:23:21 Yeah, yeah.
15:23:23 [Moderator] And then what attributes would you want to see that may be missing from metadata?
15:23:28 Um…
15:23:33 So let's see. I don't know. I don't know to what extent these have been flagged in a way that is accessible to these metadata search platforms that you're talking about.
15:23:49 When I'm looking for stuff, generally… Documentation for the data set is probably going to have you know, things like what it covers in space and time and what the spatial resolution is and what the variables are.
15:24:03 So those are the things that if those don't exist in your data cards, they should get pushed into them.
15:24:20 Um… What format is it in?
15:24:30 Is there anything… I'm trying to think if there's anything else particularly I mean, the source of it is important and
15:24:45 The source of it is important that may be more important going through a metadata search platform than through the way I usually do it because it may be less obvious when you're going through one of those.
15:25:02 What's… And by source I mean by source what entity has produced it has produced it, and how did they produce it?
15:25:15 You know, can I easily find out information about the the algorithms and the processing they did and things like that.
15:25:24 And then what was it derived from? Um… Yeah, that's… Yeah, what did it originally come from and how has it been transformed? I think is is pretty important to understanding what you can do with the data and have it be meaningful.
15:25:52 [Moderator] We're going to move on to what is missing from search. And we've touched on several of these things already, but how would you improve data set search for yourself?
15:26:01 Yeah.
15:26:06 Um…
15:26:14 Hmm.
15:26:26 I guess maybe… A thing that is really useful that I haven't come across is, is this data, how much is this data standards compliant?
15:26:40 Um you know is it is it strictly CF compliant. Are there some other standards that it adheres to?
15:26:51 How much how much work have people done in polishing this data.
15:27:03 [Moderator] And follow up to that, what additions to tools such as Google Dataset Search would you benefit from for searching for machine learning data sets?
15:27:19 Maybe more explicitly explicit information about where the data is, what format is, how do you get the data?
15:27:29 So, you know, where is it and how do you access it?
15:27:33 And things.
15:27:39 [Moderator] Last section we have here. AI assistance in searching, I want to ask, do you use large language model applications such as ChatGPT to assist you in searching for datasets?
15:27:49 No. Like.
15:27:52 Okay.
15:27:55 I I I might find… Sometimes I find, you know, sometimes Google will automatically generate an AI summary of a search of search results for you and I will glance at that and sometimes that is useful. But generally speaking.
15:28:17 For the kinds of of stuff I'm looking for, I find that in a lot of cases it's the kind of things where LLMs are not going to produce reliable results.
15:28:29 It's sort of like literature search. My experience with LLMs and literature search is that you can
15:28:40 If you're looking, if you're looking for, you know, what's the paper that everybody cites about this topic it's a fine shortcut. It's a fine form of search engine result.
15:28:54 Right. But if you're saying, you know, give me a comprehensive list it will happily tell you about papers that don't exist. And you sort of read the paper title and you read the authors and you go, yeah.
15:29:05 If somebody was to write a paper about this topic, it would probably be those people and it would probably be in that journal but they haven't actually written that paper. So this is not helpful to me.
15:29:16 [Moderator] I see.
15:29:20 Like the kinds of stuff I'm using it for is niche enough that LLMs have a much bigger tendency to hallucinate than they do to give you actual answers.
15:29:37 [Moderator] Okay, well, that concludes the main sections I have. I'm going to move to some post-test questions here.
15:29:45 [Moderator] And these are just three separate questions I have before the end.
15:29:46 Yeah. Yeah.
15:29:49 [Moderator] On a scale of 1 to 5, 1 being very hard for you, five being very easy for you, how would you rate your overall ability to locate a relevant geospatial data set?
15:30:07 Probably three.
15:30:09 [Moderator] And can you explain your rating?
15:30:12 Um…
15:30:18 It's… Sometimes the data I want doesn't exist.
15:30:24 Sometimes when I am searching for data I have to sort of look in a lot of different places and put some effort into finding out what each data set is.
15:30:33 To see if it'll meet my needs. But also, this is not something I do often.
15:30:43 And in many cases you know, you say, okay, I want this kind of information and the available pool of answers is small enough that kind of everybody knows. And it's like, oh, you know, you just it's not hard to find out, oh, here are the three options for fire data sets.
15:31:02 And I can find a few things that talk about that talk about them. Okay, I guess I'll pick one.
15:31:08 [Moderator] Is there any part of the geospatial ML workflow pipeline that causes you issues or is not satisfactory?
15:31:19 Yes. Um.
15:31:24 Xarray is the go-to library that everybody uses for geospatial data, particularly in climate.
15:31:33 And my issue with it is is that Xarray does too much and does too much automatically.
15:31:41 And for machine learning, there's often a lot of things that it will do that you want to turn off.
15:31:46 And to tell it stop being so smart. You're being smart in a way that is slowing things down and is unhelpful.
15:31:54 And it's not easy. It takes some work to turn off all the smart parts.
15:32:06 [Moderator] Last question, do you have any additional feedback or questions that you would like to pose to developers of geospatial metadata standards?
15:32:17 Um…
15:32:22 I think the thing I would say is…
15:32:28 From my experience with with climate data and providing climate...future climate projections to end users and impacts users.
15:32:40 Is that...
15:32:46 People who are climate specialists, which is a climate scientist and stuff.
15:32:52 Their abilities and needs are really different from what I would call climate impacts users, which are people who don't actually care about climate, but care about something that's affected by climate. And so they need information about what's happening with it.
15:33:06 And the latter group is huge and heterogeneous. And what they need varies tremendously.
15:33:15 From person to person and their technical capabilities vary tremendously from person to person. So I would say it's important to make sure that you've got a decent sampling of the range of what people might want.
15:33:35 And to… you know, try and try and produce something that is useful to a very broad range of people, not just to specialists.
15:33:51 Yeah, and that they're. You can't anticipate all there. There's so many possible different things that people want that you can't anticipate all of their needs.
15:34:04 So you just have to try and do something that is going to be as broadly applicable as you can.
15:34:21 And I guess maybe…
15:34:26 I guess maybe the other thing is to… is to be aware that there is, the machine learning community uses a whole lot of jargon that is very opaque to outsiders.
15:34:40 And that as more people are getting into using machine learning tools I think it's more important I think it's more important becomes more and more important to figure out what what jargon you're using and to put a lot of community effort into explaining what all that means or stop using it.
15:35:03 When you're exposing stuff to outsiders.
15:35:12 [Moderator] That concludes our questions and interview.
