[{"data":1,"prerenderedAt":4},["ShallowReactive",2],{"faqs-content":3},"# FAQs\n\n---\n\n## 1. I don't find the catalog I want. Why?\n\nCosmoHub contains public and private catalogs belonging to different cosmology projects. Public catalogs are available for all CosmoHub users once their email is validated. Private catalogs are only available for users who has the data rights. Data rights are defined by each project.\n\n## 2. Why can't I see the catalogs from my associated project (either Euclid, MICE, PAU or DES)?\n\nThe current way to validate a user who states to have data rights takes a bit of time. We have to confirm the user actually belongs to the project and have the data rights. Once this process is done the user will receive a confirmation email and the catalogs of that project will be available through the portal.\n\n## 3. Why is not available my favourite \"X\" file format to create custom catalogs?\n\nWe decided to have four different file formats:\n\n- **CSV.BZ2** format is our standard format either to produce custom catalogs or to save our raw data in the storage file system. There are plenty of bash tools to work with csv files. We encourage to read this post from Brian Conelly in which very useful commands are shown, and we also recommend python users to use the [pandas dataframe package](https://pandas.pydata.org), which offers methods to work with csv files in a very easy way. Please have a look at the next FAQ where it is explained how to read CSV.BZ2 files created through CosmoHub.\n\n- **FITS** files is a request from several CosmoHub users. We welcome their encouragement to continue improving our platform. In order to make FITS files available we had to develop a driver, which can be found in the [recarrayserde repository](https://github.com/ptallada/recarrayserde), to efficiently create FITS files from HIVE's output. There are also different packages and software to work with FITS files such as Topcat. Be careful with large FITS files since most of the software open and fully read it in memory.\n\n- [**ASDF**](https://www.asdf-format.org/projects/asdf-standard/en/latest/) format is [a new data format for Astronony](https://www.sciencedirect.com/science/article/pii/S2213133715000645). We also add this format just in case some pioneer wants to use it and to include it in our article \"CosmoHub on Hadoop: Interactive analysis and distribution of cosmological data\", [Tallada et al. 2020](https://arxiv.org/pdf/2003.03217).\n\n- [**PARQUET**](https://parquet.apache.org) format is an open source, column-oriented data file format designed for efficient data storage and retrieval. This format was the last one we have added to CosmoHub and the main reason, besides its popularity and efficiency, is that it is the only one that currently allows to deal with array elements.\n\n### 3.1 How to download CosmoHub files into your computer?\n\nThere are several ways to download CosmoHub files into your computer. Here are two recommended methods:\n\n#### Method 1: Using wget\n\nOpen your terminal and type the following command:\n\n```bash\nwget --content-disposition {LINK}\n```\n\nReplace `{LINK}` with the URL of the file you want to download. This command will download the file and automatically save it with its original filename.\n\n#### Method 2: Using curl\n\nOpen your terminal and type the following command:\n\n```bash\ncurl -JLO {LINK}\n```\n\nAgain, replace `{LINK}` with the URL of the file you want to download. This command will download the file and save it with its original filename.\n\n**Note:** If you are not familiar with the command line, you can also download files directly from the CosmoHub website by clicking on the download button next to the file you want to download.\n\n## 4. How to deal with CosmoHub CSV.BZ2 files with pandas Python library?\n\nThe CSV.BZ2 files delivered by CosmoHub are built by concatenating a series of smaller CSV.BZ2 streams, each one of them produced in parallel on the Hadoop nodes. The resulting catalog may be larger than the amount of memory at your disposal, so we recommend working with the results in chunks. The following snippet of Python 3 code shows how to open and work with these kind of catalogs:\n\n```python\nimport pandas as pd\n\n# Define the path to the catalog compressed CSV file:\ncatalog_filename = \"/path_to_catalog_filename/XXX.csv.bz2\"\n\n# Define the list of columns that uniquely identify each row\nindex_col = [\"whatever_gal_id\"]\n\n# Define the chunksize for processing the file, in number of rows\nchunksize = 10000\n\n# Open the file using pandas and read it in chunks:\nfor chunk in pd.read_csv(catalog_filename, sep=\",\", index_col=index_col, comment='#', na_values=r'\\N', compression='bz2', chunksize=chunksize):\n    print(chunk.head())\n```\n\n## 5. What User-Defined Functions (UDFs) are available in Expert mode?\n\nHere follows the list of implemented functions, with links to its documentation. If you need a specific UDF not listed here, please contact us.\n\n### Math functions\n\n- `udf.atan2(double y, double x)`\n- `udf.erfc(double x)`\n\n### HEALPix functions\n\n- `udf.hp_ang2pix(order, theta, phi, nest=False, lonlat=False)`; `udf.hp_ang2pix(order, ra, dec, nest=False, lonlat=True)` [*]\n- `udf.hp_ang2vec(theta, phi, lonlat=False)`; `udf.hp_ang2vec(ra, dec, lonlat=True)`\n- `udf.hp_angdist(dir1, dir2, lonlat=False)`\n- `udf.hp_neighbours(order, theta, phi, nest=False, lonlat=False)`; `udf.hp_neighbours(order, ra, dec, nest=False, lonlat=True)`; `udf.hp_neighbours(order, ipix, nest=False)`[*]\n- `udf.hp_nest2ring(order, ipix)` [*]\n- `udf.hp_npix2nside(npix)`\n- `udf.hp_nside2npix(nside)`\n- `udf.hp_nside2order(nside)`\n- `udf.hp_pix2ang(dir1, dir2, lonlat=False)`\n- `udf.hp_pix2vec(order, ipix, nest=False)` [*]\n- `udf.hp_ring2nest(order, ipix)` [*]\n- `udf.hp_vec2ang(vectors, lonlat=False)`\n- `udf.hp_vec2pix(order, x, y, z, nest=False)` [*]\n\n[*] Because this functions use the Java API internally, the first parameter they take is the order, NOT the nside.\n\n### Array functions\n\n- `udf.array_min(array_column)` - Returns the min of a set of arrays\n- `udf.array_max(array_column)` - Returns the max of a set of arrays\n- `udf.array_sum(array_column)` - Returns the sum of a set of arrays\n- `udf.array_count(array_column)` - Returns the count of a set of arrays\n- `udf.array_avg(array_column)` - Returns the average of a set of arrays\n- `udf.array_stddev_pop(array_column)` - Returns the population standard deviation of a set of arrays\n- `udf.array_stddev_samp(array_column)` - Returns the sample standard deviation of a set of arrays\n- `udf.array_variance(array_column)` - It is an alias of the `udf.array_var_pop(array_column)`; Returns the population variance of a set of arrays\n- `udf.array_var_pop(array_column)` - Returns the population variance of a set of arrays\n- `udf.array_var_samp(array_column)` - Returns the sample variance of a set of arrays\n\n### ADQL functions\n\n- `udf.adql_area(geom)` - Compute the area, in square degrees, of a given geometry\n- `udf.adql_box(ra, dec, width, heigth)`; `udf.adql_box(point, width, heigth)` - Construct an ADQL box from the sky coordinates of its center, a width and a height\n- `udf.adql_centroid(geom)` - Compute the centroid of a given geometry\n- `udf.adql_circle(ra, dec, radius)`; `udf.adql_circle(point, radius)` - Construct an ADQL circle from center sky coordinates and a radius\n- `udf.adql_contains(geom1, geom2)` - Return true if the first geometry is fully contained within the other, false otherwise\n- `udf.adql_coord1(point)` - Returns the first coordinate (right ascension) of an ADQL point\n- `udf.adql_coord2(point)` - Returns the second coordinate (declination) of an ADQL point\n- `udf.adql_distance(ra1, dec1, ra2, dec2)`; `udf.adql_distance(point1, point2)` - Compute the arc length along a great circle between two sky coordinates\n- `udf.adql_intersects(geom1, geom2)` - Return true if both geometries overlap, false otherwise\n- `udf.adql_point(ra, dec)`; `udf.adql_point(ipix)` - Construct an ADQL point type from sky coordinates\n- `udf.adql_polygon(ra1, dec1, ra2, dec2, ra3, dec3, …)`; `udf.adql_polygon(point1, point2, point3, …)` - Construct an ADQL polygon from a sequence of at least 3 sky coordinates\n- `udf.adql_complement(geom)` - Returns the complement of an ADQL region [*]\n- `udf.adql_intersection(region)` - Returns the intersection of all regions. Note that this is an aggregate function [*]\n- `udf.adql_region(geom, [order])` - Returns an ADQL region, represented as a HEALPix rangeset, from an arbitrary ADQLGeometry and an optional resolution [*]\n- `udf.adql_union(region)` - Returns the union of all regions. Note that this is an aggregate function [*]\n- `udf.adql_point(ipix)` - Construct an ADQL point type from a HEALPix pixel of order 29 [*]\n\n**NOTE:** all angles and angular distances in ADQL are measured in degrees.\n\n[*] These functions are ADQL extensions and therefore they are not described in the IVOA Astronomical Data Query Language proposed recommendation document.\n\n## 6. How to use ComoHub to create the input for a HEALPix map?\n\nYou need to enter in the Expert Mode in the Step 4 (Query) to type this kind of query:\n\n```sql\nSELECT udf.hp_ang2pix(12, ra, `dec`, True, True) as hpix_4096_nest, COUNT(*) as count\nFROM cosmohub.gaia_dr3_source\nGROUP BY udf.hp_ang2pix(12, ra, `dec`, True, True)\n```\n\nWhere we are counting how many objects there are in each HEALPix pixel.\n\nIn this particular case we are using the Gaia Data Release 3 table, using the HEALPix User Defined Function `udf.hp_ang2pix` with an nside=4096 that corresponds to order = 12 (read FAQ 5 for more details about HEALPix User Defined Functions)\n\nIn the following it is shown how to open the csv.bz2 CosmoHub file and to generate a HEALPix map:\n\n```python\n# Some libraries\n%matplotlib inline\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport healpy as hp\nfrom matplotlib import cm\nimport numpy as np\n\n# CosmoHub csv.bz2 file\ncatalog_filename = \"/path_to_catalog_filename/XXX.csv.bz2\"\n\n# Define nside and index_col upon the proposed query:\nnside = 1024\nindex_col = 'hpix_' + str(nside) + '_nest'\n\n# Read file into a dataframe\nmapa_df = pd.read_csv(healpix_csv_file, sep=\",\", index_col=index_col, comment='#', na_values=r'\\N')\n\n# Adding the empty pixel values since HEALPix maps cover the full sky and some footprints do not. I also add '0' to those empty pixels:\nhealpix_mapa_df = mapa_df.reindex(index=np.arange(hp.nside2npix(nside)), copy=True, fill_value=0)\n\n# Store it as a HEALPix FITS file in NEST format:\nhp.fitsfunc.write_map('Gaia_DR3_count_map.fits', healpix_mapa_df['count'].values, nest=True)\n\n# Plot it (counts in log10 scale):\ncool_cmap = cm.bone_r\nhp.visufunc.mollview(map=np.log10(healpix_mapa_df['count'].values), cmap = cool_cmap, nest=True)\n```\n\n![Gaia DR3 HEALPix Map](/Gaia_DR3_HEALPix_map.webp)\n\nAlso note that you can also use any aggregate function, and not only produce count maps.\n\nThis is an example that estimates the average of the shapeexp_e1 field for \"exponential\" morphological model galaxies:\n\n```sql\nSELECT udf.hp_ang2pix(12, ra, `dec`, True, True) as hpix_4096_nest, AVG(shapeexp_e1) as avg_shapeexp_e1\nFROM cosmohub.legacy_survey_dr8_phz\nWHERE `type` = 'EXP'\nGROUP BY udf.hp_ang2pix(12, ra, `dec`, True, True)\n```\n\n## 7. Why can vary so much the performance of similar queries?\n\nCurrent CosmoHub implementation is built on top of a Hadoop cluster. Resources are shared between all users. We have configured two different queues, one for the real-time analysis and another one for creating the custom catalogs. Resources are shared between users and therefore depending on the number of concurrent queries the performance can vary a lot.\n\n## 8. How can I contact you to ingest my large dataset?\n\nYou can contact us [via email](mailto:support@cosmohub.pic.es).\n\n## 9. I don't want to use CosmoHub anymore. How can I remove my CosmoHub account?\n\nYou can remove your account [via email](mailto:support@cosmohub.pic.es).\n\n## 10. Is there any tutorial?\n\nMore or less. Soon, you will be able to look at some videos in our [CosmoHub YouTube Channel](https://youtube.com)!\n\n## 11. Why is CosmoHub so awesome?\n\nCosmoHub is built on top of the **Apache Hadoop**:\n\n> \"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.\"\n\nCosmoHub also uses **Apache Hive**:\n\n> \"The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.\"\n\nWe have set the platform so that queries over large tables are really fast:\n\n- Hive uses the **Apache Tez** Apache Tez execution engine instead of the venerable Map-reduce engine\n- We configure Hive to use **ORCfile**, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression and more. Using ORCFile for every HIVE table should be extremely beneficial to get fast response times for HIVE queries.\n- We have used the **vectorized query** technique, which improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.\n\nIn particular we use one of the most popular Big Data solution built on top of Hadoop, **Hortonworks**. Our platform uses Hortonwork Data Platform, which includes HDFS, YARN, MapReduce, Hive and Ambari.\n\nThe frontend of CosmoHub is a responsive Web interface powered by: **Vue 3** (progressive JavaScript framework), **Nuxt 3** (full-stack framework), **PrimeVue** (UI component library), **UnoCSS** (utility-first CSS framework), and **TypeScript** for type safety and better development experience.\n\nAnd finally the backend is a **ReST API** powered by **Flask**, which includes the folowing libraries: flask-restful (ReST framework), sqlalchemy (database ORM), websockets (bidirectional communications), gevent (asynchronous framework), pyhive (hive connection library) and pyhdfs (hdfs bindings).\n",1772105430032]